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Abstract 


In the era of digital information proliferation, the potential disruptive impact on 
citizens, democracy, and society as a whole has raised concerns. To mitigate these 
risks, it becomes crucial to assess the authenticity of multimedia content and en- 
sure the dissemination of credible and truthful information. This doctoral thesis 
encompasses three distinct research pillars, addressing the challenges of multime- 
dia authentication and information leveraging through the development of novel 
approaches within the domains of spectral analysis, hypergraph theory, and con- 
volutional neural networks. The first part of this thesis focuses on an extensive in- 
vestigation of the Electric Network Frequency (ENF) criterion and its applications 
in multimedia authentication. A non-parametric approach for ENF estimation is 
proposed, incorporating a tailored lag window design into the Blackman-Tukey 
spectral estimation method. The problem of leakage reduction is formulated as an 
energy maximization task within the spectral window’s main lobe. A systematic 
study of non-parametric and parametric spectral estimation methods is conducted, 
adopting a frame-based approach and considering well-designed band-pass filters 
with appropriate band-pass edges and filter orders based on the nature of the 
recordings. To enhance the proposed approach, temporal windowing is introduced 
through the utilization of the filter-bank Capon spectral estimator. In light of 
the Toeplitz structure of the covariance matrix, a Gohberg-Semencul factoriza- 
tion technique is employed for matrix inversions, by means of Krylov matrices to 
expedite computations. Furthermore, this thesis addresses the challenges of ENF 
estimation in multimedia recordings, particularly the interference caused by speech 
content. Apart from audio recordings, an innovative automated approach is pro- 
posed for ENF estimation in both static and non-static digital video recordings. 
The varying textures, shadows, and luminance levels pose significant obstacles to 
ENF estimation in video, making it a non-trivial problem. The proposed approach 
overcomes these challenges by exploiting regions with similar characteristics in 
each video frame, referred to as superpixels. The second research pillar delves into 
hypergraph learning, specifically in the context of assessing and leveraging mul- 
timodal information for faithful, personalized image and tag recommendation.To 
address the computational complexities stemming from the extensive utilization 
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of multimodal information and the subsequent inversions of Laplacian or adja- 
cency matrices, randomized linear algebra is utilized as a means to overcome these 
challenges. A novel approach called block randomized Singular Value Decom- 
position via subspace iteration is proposed and integrated within the process of 
hypergraph learning. Additionally, an efficient learning scheme is introduced for 
hypergraph ranking, utilizing multiple optimizations. This scheme dynamically 
optimizes the hypergraph structure through the incidence matrix and employs 
adaptive hyperedge weight estimation based on the gradient descent method. The 
optimized hypergraph ranking vectors offer personalized recommendations for im- 
ages of places of interest (POIs) or tags associated with POIs. The structural and 
weight optimizations of the proposed learning scheme are solved analytically from 
first principles, providing comprehensive derivations. To further enhance accuracy, 
a Least Mean Squares (LMS) approach is developed to adapt hyperedge weights, 
which is compared to the traditional closed expression method. The third research 
pillar addresses the task of differentiating computer-generated images from natural 
ones. The proposed solution involves an end-to-end framework that utilizes super- 
vised contrastive learning and style transfer through deep neural networks. This 
framework enables discrimination by generating per-class embeddings and multi- 
ple training samples, even when initial data is limited. Hypothesis testing confirms 
the statistical significance of the demonstrated improvements achieved by the pro- 
posed framework. By extensively exploring and developing innovative approaches 
within these three pillars, this doctoral thesis contributes to the advancement of 
multimedia authentication, hypergraph learning, and computer-generated image 
discrimination, thereby addressing significant challenges in the digital information 
landscape. 
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ITeotAndbn 


Me thy 61c6007 Twv TANPOPOELOY oTHY UPL EToyH, UTdYEL Ula aVEavdUEVT ot 
WNOvYla CYETIXd UE Tic TIDAVEC ETINTUOELC THS OTOUS TOAtTEC, TH SNYOKEATLA xa 
THY XolWwvia GTO GUVOAS THC. Ta THY AVTIWETOTLON AVTOV TWY XIVOVVOY, ElVvol ON- 
UAXVTIXS va a€loAoyettar n AUDEVTIXOTHTA Tw TANEOPOPLOY TOU EUMEPLEYOVTAL OTH 
TOAVUESA XOL Va SLAOPAAITETA N TAPOYT ACLOTIOTWYV xXaL aAANnIOY TANEoPopLOv. H 
TACOVGA SiHanKToPLaT SatolBH EXTELVETOL OE THEI AUTOVOLOUC TLAMVES TOU E€eTaCOUY 
To TPOBANUA Tou EAEyYou THC avVVEVTIXOTH TAS TwY TOAVUEDWY xa THC a€Lonotnonc 
TANPOPOELWY UEOW THC avenTVENS TECCEYYloewY TOU EUTIMTOUY OTOUC ToUEls THC 
PACMATLAAS aAvVdAVONC, THC Vewplac TWVY UMECYPAPNUGTOYV XOL TWVY GUVEALKTIKOV VEU- 
CWULXOY SIXTUWV. LITO TEWTO WEEOS THC TACOvVAC StaTEIBIc TaePOVOLACETHL LLA OAO- 
XANEWYEVN, StepedvNon tou xpityetov tH¢ Luyvotytac HAexterxov Atxtvou (SHA) 
(Electric Network Frequency - ENF) xo tov epaouoyoy tou otov édeyyo tn¢ auve- 
VTLXOTH TAS MOAUUEOLKOU Tepleyoutvou. H napovon srateiBH s€etéCet tic MpOXAKoEIC 
TOU TapovVOLaTOVTA KATH THY ExtiuNnon thc UHA oe noAvuEoixéc uataypapes AdYW 
TWV LOYUEWY TACEUBOAWY TOU TEOXXAOLVTAL ATO TO DUYVOTLIXO TEOLEYOUEVO TNS O- 
urrtac. Ilooteiveton Ula UN-TXCOUETELKT TOECDEYYLON Yla THY extiunon thc VHA, 7 
Onota OYEdIdTEL XL EVOWUATHVEL Eva TapadUEOU VOTEEHONS OTH UEVOSO PaouATIXY|C 
extiunonc Blackman-Tukey. H uetwon ths Paoyatixyc Stappor|c SlaTUT@VETH WC 
TEOBANUA YEYlLoTOTOINONS THS EVEPYELAS EVTOC TOU xUELOU AcBoU Tou MacyATIxoU 
TapadveoU. EmimA€ov, TeWyUATOTOLELTOL Ll CUOTYMOTIXN UEAETI UN-TOUPONETOLKOV 
XAL TAPOVETOLKWY UEVOSWY PAoUATLXY|S ExTiUNONS Ya Tov UTOAOYLOLO THs MHA ye 
Ula TEOEY YON Bactovevy OTHY avddvON TeUaytwy ToL ofUaTOc, AaUBd&vovtac UTOYN 
EVA KWAK CYESLASUEVO CWVOTEEATO PLATO, Tic OVYVOTHTES aToxoTY}c, xAVa entong 
Xa THY THEN Tou GiATEOV avdAoyYa UE TH MUON Twv evyeapmv. H moeotewouevy 
TOOSEY YON, ETEXTELVETOL TECULTEOW UE THY EloAyWYT yeowxrc TacaduUpoTOinoNs YUE 
B&on tov extiunth Pdoyatoc Capon. H uchétn evowyatover Evav TORO TAEMYOVTO- 
notnonc Gohberg-Semencul tov mivaxa ouuustaBAntotytwv Tou YOvTéAov Adya 
ty¢ Soufc Toeplitz tov nivaxa ouuuetoPAntotitwyv. Xenoworotovvta mtvoxec 
Krylov yta ty yeryoen vAonoinon ths avtiotpogrs Tivdxwy. Extd¢ and tio nYXN- 
TIKES KATUYEAPEC, TEOTELVETOL LL YUTOUATOTOLNUEVN TEODEYYLON Yla THY Extiunon 
tyg MHA o€ otatixés xo UN-oTaTixEs YNplaxés eyypapec Bivteo. Or dikpopec UPec 
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HAL Ta Srapopstixd ettneda oxlaoncg Kor PwtetvoTHtac, ceUTOd{Tovv Thy ExtiuNnon THS 
UHA o€ otatixh xo Un-otatixk Bivteo, xatlotwvtac Thy extiunon thc UHA Eva 
UN-TeTelUUEVO TEOBANUA. H meotewouEvn Teoceyylon BuctCetu oTHy exusTo&AAEVON 
TEPLOYOV HE TAPOUOLA YOCUXKTIELOTIXa os KaVE xapE tou Bivteo, OL ONOIs¢ OVOUdTO- 
vtat superpixels. To de0tEe0 YEpOS Efvon apPLEePWHEVO OTHY ExUdUHON UTECYEAQN- 
UTWY OTO TAakolo THC AELoAdyHoNs xo a€lonoinons TOAVTEOTLKXOY TANPOGOEIOY La 
TH OVOTAON ELKdvuv au ETLKETAV. Tra va auBAvVIEt to TEdBAnUA Tov LUPNAOU UTOAO- 
Ylotlx00 XGOTOUC TOU TPOXUTTEL ATO Tic aVTIOTEOMES Tov AaTAdoLaVvod T TOU Tivaxc 
YEITVIASNS, YENOWOTOLEITOAL TUYMOTOINUEVN YEAUWIXH dAyveGoa. Uuyxexpiweva, Too- 
TELVETOL WLOX ETAVAANTTUXH TOOCEYYLON, TOV OVOUATETOL KATH TUAUATY TUYOMLOTOLNEVT] 
Avédvon Idiafovcmy Tiuoy (Block-randomized Singular Value Decomposition), 

TN] OTOLA EVOWUATOVETH STN Sad xaota ExTiUnoNs Tov Paopwv Twy UTECYPAPNUATOY. 
Emind€ov, mootetvetan Eva aAMOTEAECONATINXO CVOTHUA ExUdDNONC Yla Tov UTOADYIOUO 
acLoTiotou Siavdouatoc SoPdduonc ce UTEeyeapruata BaciCouEvo oe ToAAaTAE¢ 
BeAtiotonoijoetc. To obotnua avt6 BeAtiotorotet SuvoXd TH SOUR TOU UTECYEa- 
OFYATOS YECW TOU TivAXKA TOCOTTWOEWV XO YENCIWOTOLEL TECDHELOO TIM ExTiUNoN 
tou Bhoous Twy UTEPAXU®Y Tou UTEEYPAPHAtOs UE Bdon ty UEVOSO xaTéBaoNC TOU 
diavdouatog xAtong (Gradient Descent method). Ta Betis tonomnyeva dtavvouata 
dioSddutonc Tou UNEPyPAPHUATOS TeOTEtvoUY ELXdvES ONELWY evdtapépovtog (POI’s) 
f ETIXETES ELKOVWY TOU OYETICOvTaL UE ONElA evdrapépovtoc. Or Lodnuatixy avdAv- 
ON Twv PeATlOTOTOINOEWY THC TOTOAOYiac TOU UTEPYPAPHUATOS KO TV BoCwv Twy 
UNEPUXUOV TOU TOOTELVOUEVOU GUOTHUATOS ETLAVOVTOL XVAAULTIXE KO TAPEYOVTOL OL 
TANpsic Ladnatixés avardoeic. Tra thy nepatépw abEnon ths axeiPerac avantvo- 
OETA Ula TEOCDEYVyLON EAaytotwy Méowv Teteayavwv (LMS) yta thy toocaguoyy 
TwV BUPwY TWY UMEPUXUOV Ka OUYKELVETAL UE THY Taeadoclaxh YEVOSO xAELo TOU 
tunov. To teito Uépoc avaPEPETAL OTHY AVTIUETAMLON ToL TECBAKATOS THC SidxeL- 
ONS TWV ELXOVWY TOU TAECYOVTOAL UE YETON UMOAOYLOTY ATS TI¢ PUGIKEs Elxdvec. H 
TOOTELWOUEVH AVON Yia TH Stapoporoinoy Elxdvwv Tov TAEkYOVTH ATO UTOAOYIOTH 
(Computer generated images - CGIs) and puotxéc etxdvec (Natural images - NIs) 
TEPlLAQUBdvEL Eval OAOXANEWUEVO TAaictlo TOU yenoworotet emBAetduevy avtwWetlxy 
UuInon xo UEeTaApoEd Teyvoteontac (style) U~ow Bardiwy vevowvixoy SixtvMv. Me 
Tic Stavucuatinés tapactéosic (embeddings) avé xatryyopta xa thy Tapaywyh 
TOAAATAOY SELYUATWV EXTALSEVONS, TO NPOTELVONEVO TADLGLO ETITOETEL TOV SLOYwpl- 
GUO ywpic Va ANOUTODVTOL ONUAVTIXES TOGOTHTES HOYLKwY SedoUEVWV. O Edeyyoc 
unovéoewy emtBeBordver StL ot BEATIWGELC TOU ETITUYYd&VOVTAL UE TO TEOTELVOUEVO 
TAXMOLO EVAL OTATIOTLXWS GNUAVTLXEC. 
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Eixtetavevn meetanbyn 


Me thy aévan ecérren ths bps exoyns, n TANIA Twv TANEOPOPLOY TOU Sioxt- 
vouvtat diamepva XQVE TTLYH THS XOWAVIOLC, OLAULOEPHVOVTAG TOV TOOTO TOU AVTLAG- 
Bavoudote xo aAAnAeTLOooUUE UE Tov XdouO. Qo td00, UECU GE AUTOV TOV ATEAELWTO 
OYXKO SESOUEVOY, 1 HUVEVTLXOTNTA KO 1} OXEPULOTYTA TOU NOAVUEOLXOU TEPLEYOUEVOU 
yivovta OAo xo TLo XOtoA xo amouTYTIXd. H evosta didd0oN TAKS TOYEAPNUEVWY 
TOUOUTAAVNTIXOY TANCOPOELWY UECW TOALUEOWY VEtEL TEOXATOEIc, KAOVIToVTAS THY 
EUTLIOTOOUVN XO ATELA@VTAS THY {ota THY xoWWwvia. “EtoL, n Mlotonotnon Tou TOAUUE- 
OLXOU TEPLEYOUEVOU ATOTEAEL Ula avayxala TecoTd&VELAa xo KaTEVDUVON OTOV TOUEH 
TYG EMLOTHUNS TWY UMOAOYLOTWV, TOU EMOLXEL Va UTOOTHELEEL THY AANVELA, VO aTO- 
HATUOTHOEL THY MOTH Kou var SLaAPLAGEEL THY AXEECLOTHTA THS PHpLaxric ETLXOWwviac. 
LTO TOTO TUNA THS Tapobouc SrateLBHc e€etdCetou, avadveToun xo a€toAoyeita E- 
XxTEVWS N YEHON THe Uvyvdtytac HAexterxob Atxtvou (SHA) (Electric Network 
Frequency - ENF) o¢ epaguoyés ehéyyou ty¢ audevtxotyta¢g MoAUUEOLKOU TEpte- 
YOUEVOU ciTE TOOXEITOL Yla NHyNTIXES KaTAYPAMES cite Yla xaTaAYeAPe Bivteo. H MHA 
otic Hvwyévec Todttetes anavtéto ota 60 Hz, eve otny Evewnn ota 50 Hz. On tt- 
vec THC MHA anotedovy Eva UN-TEELOSIKd OYA TAPOVOIETOVTAS LLKEES SLAXLUCVOELG 
TOU OMELAOVTOL OTHY OTLYULala Srapood avaEGA OTHY TapAyOUEWN Xo TH CntobuEvH 
NAEXTOLXY loYY ATS Tous TapaywYoUs Ka TOUS KATAVEAWTEC, aVTIOTOLyA. OL StoKU- 
udvoeic thc LHA Sev eugaviouy neelodixdTNTA xO, OUVETA , Sev stvor SUVATO va 
tTeoBAcpdovv. H ovyxexewevn wrdtyta, Watt Us THY SuvaTOTHTA THC va TaPAUEevEL 
otaveer Evtdc tov tdiov NnAextoixod Sixtbov, xadioTtk TH LHA Eva anoteAcouatt- 
xO WEOO Edeyyou auUdevTIxdTHTAg MTOAUEDWY. O UmOAOYLOMdS xa  ECAYWYH THC 
“HA anatet axptBera xo anoteAcouatixdtyta, xadac ovyva xadiotata SboxoAO 
va aviyvevvet AdOyw VoobBou oe ofuata outrtac, xaVadc extonc xo AdYW xtvNnone, 
QUTLOLOD, UPOV, Xa TEELBGAAOVTOS YHOO YEVIXOTEOA, OTUV TEOXELTAL YlH KOTO POL 
wéc Bivteo. O edeyyocg audevtixdTHTAg Tou NoAVUECLXOU TEpleyoUgVoU oyeTiCeTau 
UE WAAOLWOELG TOU EYOLV TEOKUYEL ATS THY TECDDHKN TANEOPOELWY CE SiapogETIxec 
YPOVLXES OTLYUES Tt AMO SiaMpOpETIXES YEWYEAGIKES TEpLoyéc. tO TEWTO YEEOCS THC 
Tapobous StateiBAc EUpaon SOUNKE CE TNUPAMETPLKES KO UN-THCOUETOLXEC UEVdSOUC 
PUDLATLXNS AVGAVON|S Yla THY aATMoTEACOUATIXH ECaywyyh thc UHA, xodac extonc xou 
OTHY Stadixaxota Tou TooNyetton Kou MatCEL KATAAUTIXG EOAO OTHY axel ELaywyh THC 
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LHA xo OTIS NEOINTOOELS TWY NYNTIXOV KATAVOAPMY, HAAG KOU Twv KoTOLYEUPey 
Btvteo. 


e Yrov_eon 17: O oyediacydc tou xatéAAnrou napadueou xaJUGTEENONC UTO- 
El VO UELWOEL Tc PuoUATIXES Siaoposc BEATIVOVTAC THY ECaywyh TH UHA. 


MeyaAo yépoc THs MaPOvOUG EpEUVac EOTIACEL OTN NEAETH TPODEYYIOEWY KL Ta 
CUMETOLKOV KOLL UN]-NXCOUETOLKOV UEVOSWY PAGLATLAN|S avdAVOTNS Yla TH BEATIOTH E€a- 
yoyy thc SHA. Tooteiveta o npocupUooUEvos oyEdiao"ds Evdc TapadbeoU xaDuU- 
OTEONONC TOV EVOWUATMYVETAL OTHY HEDOSO PacyaTtixfc avéAvon¢s Blackman-Tukey 
YX TOV TEPLOPLOUO TOV PACUATLAY SLAPEOWY ToU XUVELWs AoBOU. H PacyLaTLXA Stae- 
EOF SlATUTMVETAL Wo Eva TOOBANUA UEYLOTOTOINONS THS EVEEYELas GtOV xUELO hod 
Tov PuovaTixod TapatUeov. Lnyatvovta edAo otHy TeAvar omotBera eLaywyrc THC 
UHA ratZovuy ot maodueteot ToL OYEdSLAOOU Tou TapadUEoU VoTEPONG, ot OToioL xa 
UEAETOVTAL, WOTE va KaTadElyVet oe Tolo BodUd exnoedCeta n eCaywyy tho UHA. 
To Teleduata TeWYUATOTOLOUVTAL GE SUO GET DESOUEVWY SLAPOPETLAEY ETITESWV Vo- 
evLBav ovWU~uva ye TH PiBALloypapta. H nmoeotewduEevn TeOceyylon oUyxelvetoL UE 
umdeyouces Teoceyyiceic HS BiBALloypaptiac xo 0 OTaTIOTIKdS EAcyyoo UTOVEDEWY 
Tou Toxyuatorotettan StaSeBordver Ott ot PEATIWGELS TOU ETITUYYd&VOVTaL Elva OTOTL- 
OTUKWS ONLAVTLXEC. 


e Yrov_eon 2y: H uciwon ty¢ umodoytotims NOAUTAOXOTHTAS THS aAVTLOTEO- 
ofc Tivdxwv tho UeVOSoU Capon, exUetaAAcvduevot tH Sour, Toeplitz 
TOU TVAXA GUUUETABANTOTHTWY XL YENCIWOTOLWYTAS THY TaEMyovTOTOiNoN 
Gohberg-Semencul o¢ cuvdvgsyd Ue XATHAANAX ypovixd Tapd&DUEA UAKOUC a- 
XOUN Kot EVOC SEUTECOAENTOL, OdNYEl oe UIpNAEc omoiBetec ELaywync tH UHA 
Tou Ceneevovv Tig To obYYPOVES THC BIBALoyeaptac. 


H xatddAndn entroyy mapabveou, avetdotnta and to ueyeddc tov, ce cvvdva- 
OUO HE THY TEOETEEECYAOLA TOU OTUATOS KO THY XATHAANAN ETLAOYT] Tw TAPANETOWY 
odnyel of anoTEAcONATLXH E<ayoyh thc DHA. Baowds nopdyovtuc ths Tooemetepya- 
ota TOU UNO ELETAON OTUATOS AMOTEAOLY OL OVYVOTHTES ATOXOMIS TOV CwvodtaBatou 
@IATOOU TOU YENOMOTOLHUNXE OTIc TElpaaTLXes Stadixaotec. EninAgov, xaVoerottxd 
EOAO OTHY ToLoTHTA THs ECayVrouc MHA nailer n THEN Tov CwvodiaBatov piAteov. 
To neloduata xatédegav ott tia AavOaouevy emtAoyy UMopEt va OdNyHoEl of hav- 
VAoUEVY ExtiUnon OYETIXd US THY AUVVEVTLXOTHTA TOU NOAVUEOLXOU TEPLEYOUEVOD. 
AouBavovtag undp ty Sour, Toeplitz tov nivaxa ovuretaBAntoTHtwv Teoywpernoa- 
UE of TaCMyovtoroinon Gohberg-Semencul Ue oxond TH TayOTEEN AVTIOTEOHH TOU 
EV AOYW Tivaxa, O OTOLOG Teiver va avVEdvETA Goo avVEkvETOL xO TO UMO E€ETAON 
ofa. Exueto\Acuduevot tov mivaxa Krylov odnyobuaote oe ula véa wodnuo- 
TIXA AvaTAPdGTAGY TOU AVTLOTEOMOU ToU Tivaxa GUUUETABANTOTYTWY TOV OTOLO Xa 
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UMOAOY(TOLUE ADVOVTAS EVA DUD THUA YOUMMLXV ELLOWOEWV YONOIWOTOLWVTAS TOV OA- 
yoovWuo Levinson-Durbin. Méou and ovo tnUatind TEloduaTa XATASELKVVETOL OTL TN 
efaywyy thc UHA ue thy meotewouevn UEVObo xa OF GUVOVAGLO UE TH XENON TOU 
yeovxov napatiecu Parzen yr ouc Uelwuévou xat& 95% oe oyéon UE HUTOY TNC 
BiBAvoypapiag Eeneovder tic Tlo TOOOPates UEVOSOUC TAPOVOLATOVTAS DTATLOTLKOS 
ONMAVTLKES Staucpopec. 


e Yr6Veoy 3: H caywyh thc MHA of xatayeapec Btvteo unopet va e- 
TitevyVet amoteAcouaTixd ywpic va KoDiotator annoaltyty n ATOLOVWON TH 
AXWOVUEVOV GWUGTOV XKOL AVTLXELUEVWV TOL BoeloxXOVTAL GTO TEOGKYMO 7 GTO 
TLUPUSXTVLO. 


Todogata anodetyynxe ott 1 UHA etvon duvatd, extd¢ ANd Tig NYNTIXES KoLTOL- 
Yeapec, va evtomiotel xu oe Ynplraxéc xataypapec Bivteo. Ot xataypapéc autéc 
TACOVOLATOLY LLattEENn SUdXOALA OTHY ECaywyh Thc UHA, Wratteoa otic MEPIMTWOELG 
Tou TeethauBdévouy xtvouueva avtixetueva A avtewnouc. H duoxodla xAwoaxadveta 
OTIS TEPINTWOELS KATH TI¢ OTO!ES OTHY CXNVY VEloTaAVTAL SIcKPpOPEC UPEc, Sta~opeETt- 
HES MUTEWOTHTES Ka Oxlec, xAVwS ETloNc Xa xIVHOELC TOU XaTHAMUBAVOUY LEYGAO 
TOGOOTO ETLpdvElac Tou xdVEe nape. Ne HUTS TO TUAUA THC StaTeIBrc MOOTEtvEeTOU LO 
QUTONATOTOINUEVN TOOOEYYLON Yla thy extiunon thc UHA and xataypapec Bivteo, 
YONOCIMOTOLWVTOLS TOV ATAO YEUULLXO ETAVAANTTIXO HAY OELDO OUAdoTOtnoN¢ (Simple 
Linear Iterative Clustering - SLIC). Me ty yehon avtov tov adyooituou, Snutoue- 
YOUYTOL TECLOYES WE KOLA YAPUXTNPLOTLXE TOU TEeLhauBdvouv UTEE-ElKOVOOTOLY Ela. 
H mpotewouevy Teooeyyton AauBdver UTd~N UOvo Ta UMEP-ELxovooTOLyEia TOU EyOUV 
YEON Evtaon LYPNAOTEEH ANd Eva TEoKADOPLOUEVO KXATHPAL KataSerxvveto OTL E- 
VTOS AUTY Twv Teoloyw@y, n MHA Sev ennpedCeta and nopeuBorec, Eyovtacg we 
ONOTEASOMA Ula To axeIBr extiunon, ave€dotHTta amd to av to Pivteo Elvor OTATIXO 
7 oyt. H meotewouevn Teocéyylon Snuloupyet TMeployec UE TAPdUOLA YAPUXTNELOTI- 
x& xa UTOAOYIZet TH MHA pdvo oe autéc Tig MEployéc, oe avtibeon Us dou Eyouv 
emitevyVet UEXYEL OfUEpA OTH PiBALoyeagia, GToU Efvot avoryxata N Epaoyoyy aAyoptd- 
UV APAlPEONS TWV XLVOUUEVWY GWUATWY AVELVOVTAS ONUAVTIXA THY UMOACYLOTLXA 
NOAUTAOXOTHTA Xa TOV yeOvO. Tra tig Teloauatixécs dtadixactec yenoworomnsnxav 
KATAYEApes XAWAXOUUEVIS SLOKOAIaG TOU TEeODOUOLATOLY xdUEpEC aowaAsiac. Ot 
BEATLMOELC GE GUYKELON UE Tic UeVddoUC THC GbYXYeOVNC PIBALoYeapiac EAE ONKOL 
Xo BOEVNXAV OTUATLOTLAWS GNLAVTLKES. 


e Yroveor 47: Tlo\danhec PeAtiatonoifjoeic Tou apogovy Ta SLavdoUATA dta- 
Bodutons, THY toroAoyia tou UTEYpAPHatoc xo Ta BkoN Twy UTEPAXUAVY TOU 
UNEPYPAPTATOS SLVEALVTOL Va OLUBGAAOLY OTHY AVENoN THs axetBetac THC EEa- 
TOULXEVUEVNS OVOTAONS TANEoPoELV Tou PaciCovta oe UMECYOAPHUATA. 
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‘Eva UNOTEASOUATLXG GVO THUG UdUNONS Ya TOV UTOAODYLOUO Tou SiavbouaTos Sto 
Badurong oe UNepyeuphuata BaciWouEvo of ToAAaMAgc BeATiotoNOUOElc TOOTEtvETOU 
Yla TH GVOTACH oNNUElwv Evdlapépovtoc. To ovotna wUtTd BEeATIoTOTOLEt SuVvoLL- 
xd TH SOU TOU UMEPYEAPHUATOS UEGW TOU TIVOXA TEOOTTMOEWY XAL yONnoWorotel 
TOOSKELOO TIX EXTiU HON Tov Baopwv Twv UTEPUXUWY ToU UTEEYPAPHUaATOs UE Béon 
TH UEVOSO TH xaTdBUoNe Tov SiavdoaTOg xAtong (Gradient Descent method). Ta 
BeATLOTOTOINUEVa StavUcvATa StosdDuLaNS Tou UTEEYEupPAtoc TeoTEtvoUV ELKOvEC 
ONUELWY EVOLUPEPOVTOS Ff ETIKETES ELXOVWV TOU OyETICovTa UE ONUEia EvVdLApECOVTOS. 
ETINAEOV, AVATTUGOETOHL KOL EVOWNOTWVETOL EVA CUUTANOCWYATIXO GUOTHUM ONLACLO- 
OYLArs AVAYVAPLONS TO OTOLO AELOTOLEL TA OTTIXA YUOUXTIPLOTIXE TwV ELXdVEY Bat 
OLCOUEVO OE EVA OUVEALKTIXO VELPWWLKO StxTtUO. To UnEeyEd~NUA ToU amoTEAEt To 
TOOTELVOUEVO OVO THA XATAYOCPEL XL AELOTOLEL ONUAOLOAOYIXES TANPO@optec YETAEY 
TWY KOUBWY, OTS OL ETIKETES TOU GUVOEOVTAL HE XVE ELXOVA, OL YEWYPAQLKES OUVTE- 
THYUEVES TOU oNEtou Tou anetxoviCetar, xaDad ETON XO TH ONTIXA YAPAXTNPLOTLXE 
thc xdde eixdvac. H andd0on tou nootabévtog CVOTHUATOS ECATOMLXEVUEVWV OU- 
otdoewv BauctCeton UETAED dAAwY of did@opEec TapaUeTeOUC. AUTH UE TH UEyaAUTEoN 
OUUBOAH Tou anavtdta oTHY Eflowon THs BEATIOTOMOINONS OTS TOMOAOYIas TOU vL- 
TEPYPUPTATOS ECETHOTHXE EXTEVEOTEPA, WOTE Va xaTAdELYDEt oO BatUdc CUVELOPOEdG 
thc. H yodnuatixy avdAvon twv BeAtioTOTOINGEwY Tou SLavvauaTtog dtaBdduLoNs, THC 
TOTOAOYlAS TOU UMEPYEAPAUATOS XOL TWV BAPWY TWV UMEPUXUOV TOU TEOTELWOUEVOU 
GUOTHUATOS ETLAVOVTO XOL TAPEYOVTOL ol TApEic UatnUaTiKEs avadvostc. Tia thy 
TEpatgow abEnon THs axptBEracg TOV OVOTHUATOS AVANTUDOETOL LLA OXOUN TPODEY- 
ylon mov BaciCeta otn ueVodo EAayiotwy Méowyv Teteayavwv (LMS) ya ty 
TOOGALOLOYY) TWY BACWY THY UTEPUXUOY XaL OUYXOLVETAL UE THY TaEAdocLlAxy UEVOSO 
xAslotoOU TUTOU. 


e YxrdVeory 5y: H eqaouoyh tuyaonompevnc youumixrc drAyeBeac othny A- 
voAvon ldtaCovowy Tiay Uopet var UELWOEL THY UNOAOYLOTLXY, TOAUTAOKOTHTO 
TNC AVIOTPOYTS TIVEXWY YLX TOV UTOAOYLOULO AELOTIG TOU StavvouaTos SrofScd- 
ULONS OTH UTECYPAPHUATA SraTtNOwvtac TAKOCHAANAG THY axptBera. 


LTO TUAUA LUTO TH SiateIBric TueovaraéTovta VO TECcEYyicElc NOUV OTOYEVOVLY 
OTHY UElwor Tou UTOAOYIoTIXKOU xdoTOUC THS BeATLOTOTOiNaNS TOU StavVGUATOS StaO- 
Bodutonc Twv UTECypAPHUdtov. Ot TLYAMOTOLNEVOL AAYOOLOUOL YENCWWOTOLOUVTAL 
Yla TapayOVTOTOIYCEIS TIVdXwWY YaUNAOD BatUov UE oxOTO va aTocUVTEdOUY TeO- 
GEYYLOTIXE KOL OTN GUVEYELA va avtIoTeApovY. Tra va avtietwmotet to TedBANLA 
TOU VPHAO’ UMOAOYLOTIXOU XGOTOUG TOU TOOXUTTEL ATS Tic avtioteopes Tou AaTAd- 
ClavoU fH TOU Tivaxa yeltviaonc, yonoworotetta Ula Teocéyylon BacwWouevN OTH 
TUYMOTOINYEVN Yeon ddAyeBou. Muyxexewueva, Mootetvetor n xaTH TUNUaATH €- 
TAVOANTTLAT TUxaoToMevy Avédvorn IdtaCovomy Twov, n onota evowatoverto 
OTH Stadixaota extiunoncg tou Siavvouatoc sraSdduLoNns tou umepypaprhyatoc. Ar- 
uloueyavtas blocks uewouévou yeauoBoduod (rank deficient) otny xde1a Sayan, 
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ENITOETEL OTOUS UTOTIVAKES YAUNAOL BarUOU va avTIDTEAPOUY ELWVOVTAS TO YCOVO 
EXTEAEONS UEOW THS TUyaoToMpEN Avadvoncs IdtaCovo@v Twwaoy xatéh tunata. 


e YrovVeorn 6y: H yohon evdc rAmotov embrenduevng avivetixc uddnonc 
Of OVVSVAGUO UE UEVOSouC ETAPpoEdc TEYVOTEOTILAS YLa AVENOT Twv SELyUaTOV 
exTatdevans OONYEL CE AUNOTEATOUATLIXAOTEPO OLUYWELOMO TV MUGLKWY ELKOVWY 
ANd HUTEC TOU EyoUY SNULoUEYNVEL YE YEHON LTOAOYLOTH. 


H emGaenduevn aviwetixy ud0non anotedet tH Baotxh TPODEYyLoN STO Saywel- 
GUO TOV PUCLKOY ELXOVOY ATO HUTES TOU EyouY SNUloUEY VEL UE YEHoN UTOAOYLOTY 
xa e€eTaCeTa OTO TetTo YEPOS THC Tapovouc SrateIBrc. Ilo cuyxExEIEeva, TaPOVOI- 
aCeto Eva TAdtowo Tou BactCeta OE OUVEALKTING VELEWWXE OIXTLVA Yla THY ExUdOHON 
TOAUTAOXWY avaTapAcTéoewY UE yeron emPAcrduevNs avtetixyc U&UNoNc. To 
TAQLGLO AUTO EVIOYVETAL ATS Lin GUUTANPWUATIXY Lovdba WETAPopdic TeyvoTeOTtac, 1 
OTOLA EVIOYVEL THY ANOTEACOUATLXOTHTA TOU SLAYWOELGUOU THY ELXOVWY TOECDVETOVTALC 
ETUMAEOV SElYUATA EXTALBEVON|S UXOUM XO. CE TEOINTWOELS OTOU YENOIMOTOLELTAL OVO 
EVE TUNA TOU apy!xoU cet SedouevWv. H Yovdda aut Acttoupyel Ge TeayYATIXO 
YPOVO EPUPUOTOVTUS THY TEYVOTPOTIA OUYXEXPIUEVWY ELKOVWY TOU EyoUV TapayVet 
ATO NAEXTEOVLAO UNOAOYLOTY] OTO TEPLEYOUEVO TOV QUGIKOY ELXOVWY THOKYOVTAS VEX 
detyyata exmatdevonc. LTO TEWTO OTHSLO, EXTOLBEVETOL EVA GUVEALKTLAG VEVEWVLXO 
dixtvo BuciGduevo oTHy woyitextowxy tou ResNet-18 us tH xenon emiBActduevnc 
avtwetinncs Uddnons. Katé& to deUtTECO OTHSIO, TO WOVTEAO TOU EXTOALDEVTNXE KATH TO 
TOWTO OTHSLO SiSEetat we eicodoc¢ GE Evav YOAULLXS TaEWwoUNTH WE TH YEHoN ovvaetn- 
onc xOotOUs GVoYEevteOTtac. EmmAéov, To TeotewouEvo TAaiowo UETe To TEAC KaVE 
OTAdLOUV EPAPUdTEL OTOYADTIXO EGO OPO Bapwy EvIoyVOVTUS ETOL THY yevixevON Xa 
GUVELOPEPOVTAS OTH OTAVEPOTHTA TOU WOVTEAODL. 
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Chapter 1 


Doctoral study aspects 


In the ever-evolving digital era, the abundance of information permeates every 
aspect of our lives, shaping the way we perceive and interact with the world. How- 
ever, amidst this vast amount of data, the authenticity and integrity of multimedia 
content have become increasingly critical and demanding. The pervasive dissemi- 
nation of falsified or manipulated media poses formidable challenges, eroding trust 
and undermining our society. Hence, in this epoch of boundless connectivity, the 
authentication of multimedia content emerges as an imperative endeavour, a cru- 
cial frontier in the field of computer science, seeking to uphold truth, restore faith, 
and safeguard the integrity of digital communication. 

In the first part of this thesis, a comprehensive investigation, analysis, and 
evaluation are conducted regarding the utilization of Electric Network Frequency 
(ENF) in applications related to multimedia content authentication, encompassing 
audio and video recordings. The ENF signal in the United States is consistently 
observed at a frequency of 60 Hz, while in Europe it is found at 50 Hz. The ENF 
values represent a non-periodic signal exhibiting minor variations arising from the 
instantaneous disparity between electrical power generation and consumption by 
producers and consumers, respectively. These fluctuations lack periodicity, making 
their prediction unfeasible. This exceptional characteristic, combined with ENF’s 
capacity to remain stable and unaltered within the same interconnected network, 
endows it with effectiveness as a means of verifying the authenticity of multime- 
dia content. The computation and extraction of ENF necessitate precision and 
efficiency, as it becomes challenging to detect due to interference in speech sig- 
nals, as well as factors such as motion, lighting, textures, and the general ambient 
environment in the case of video recordings. Multimedia content authentication 
pertains to identifying alterations that result from the insertion of information at 
different timestamps or originating from diverse geographical locations. The pri- 
mary focus in the initial part of this thesis lies on parametric and non-parametric 
spectral analysis techniques, which facilitate the effective extraction of ENF. Fur- 
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thermore, emphasis is placed on the preceding process, which plays a catalytic role 
in accurately extracting ENF from both audio and video recordings. 


e Case 1: Designing the appropriate lag window can reduce spectral leakages 
improving the extraction of ENF. 


A substantial portion of the doctoral research is devoted to the examination and 
advancement of parametric and non-parametric spectral analysis methodologies 
to achieve an optimal extraction of ENF. In Chapter |2| a customized design of 
a lag window, integrated into the Blackman-Tukey spectral analysis technique, 
is proposed to mitigate spectral leakages originating from the main lobe. The 
mitigation of spectral leakage is formulated as an energy maximization problem 
within the main lobe of the spectral window. The parameters associated with 
the design of the lag window assume a pivotal role in determining the accuracy 
of ENF estimation, and their impact on the extraction of ENF is thoroughly 
investigated. The experiments are conducted on two distinct datasets featuring 
varying levels of noise, as reported in the literature. A comparative analysis is 
performed between the proposed approach and existing methodologies found in 
the literature. Statistical hypothesis testing is employed to affirm the statistical 
significance of the achieved improvements. 


e Case 2: Reducing the computational complexity of the matrix inversions in 
the filter-bank Capon spectral estimator by exploiting the Toeplitz structure 
of the covariance matrix and employing the Gohberg-Semencul factorization 
in combination with appropriate temporal windows can lead to accurate ENF 
estimation. 


The accurate extraction of ENF necessitates careful consideration of various 
factors, such as the selection of an appropriate spectral window, irrespective of its 
size. Additionally, a preprocessing stage and the careful calibration of its parame- 
ters play a pivotal role in achieving precise ENF extraction. Of particular signifi- 
cance in the preprocessing stage is the determination of cut-off frequencies for the 
bandpass filter employed in the experimental procedures. Moreover, the choice of 
the bandpass filter’s order significantly impacts the quality of the extracted ENF. 
Empirical experiments have highlighted the detrimental consequences of incorrect 
parameter selection, leading to erroneous assessments of multimedia content au- 
thenticity. 

In Chapter 8} to expedite the inversion of covariance matrices, which possess a 
Toeplitz structure, the Gohberg-Semencul factorization technique is utilized, en- 
abling efficient matrix inversion. It is widely acknowledged that as signal length 
increases, computational requirements escalate accordingly. To address this chal- 
lenge, we leverage Krylov matrices to derive a novel mathematical representation 
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of the inverse covariance matrix. The inverse is computed by solving a system of 
linear equations using the Levinson-Durbin algorithm. 

Systematic experimentation demonstrates the efficacy of our proposed ap- 
proach in conjunction with the Parzen temporal window, which is reduced in 
length by 95% compared to existing approaches documented in the literature. Our 
method surpasses recent methodologies, exhibiting statistically significant differ- 
ences in the derived ENF values. 


e Case 3: Efficient extraction of ENF from video recordings can be accom- 
plished without the necessity of isolating and extracting moving entities and 
elements from either the foreground or the background. 


Recent advancements have demonstrated the feasibility of detecting and ex- 
tracting ENF from digital video recordings in addition to audio recordings. How- 
ever, the extraction of ENF from video recordings presents inherent challenges, 
primarily due to the presence of moving objects or individuals within the scenes. 
These challenges are further compounded when the scenes encompass diverse tex- 
tures, variations in brightness, shadows, and substantial movements that occupy 
a significant proportion of each frame. 

In Chapter |4| we propose an automated approach for estimating ENF from 
video recordings by employing the Simple Linear Iterative Clustering (SLIC) al- 
gorithm. The proposed approach generates superpixels, which are regions with 
common characteristics, to facilitate the estimation process. Specifically, only su- 
perpixels with an average intensity exceeding a predefined threshold are considered, 
ensuring that regions unaffected by interference contribute to a more accurate ENF 
estimation, irrespective of the video’s static or dynamic nature. Notably, the pro- 
posed approach generates regions with similar characteristics and computes ENF 
solely within these regions. This deviates from existing literature, where back- 
ground and foreground subtraction algorithms are typically employed, resulting in 
increased computational complexity and time requirements. 

Regarding the experimental aspects, we employed recordings simulating secu- 
rity cameras that progressively increased in difficulty. Statistical analysis con- 
firmed the statistically significant improvements achieved by our proposed ap- 
proach over state-of-the-art methodologies documented in the literature. 


e Case 4: By incorporating various optimizations pertaining to the ranking 
vector, the hypergraph’s topology, and the weights of the hyperedges, it is 
possible to enhance the accuracy of hypergraph-based information recom- 
mendation. 


In Chapter |5} we propose an efficient adaptive learning framework for com- 
puting the ranking vector in hypergraphs, specifically tailored for the personalized 
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recommendation of places of interest (POIs). The system employs multiple opti- 
mization techniques to dynamically optimize the hypergraph’s structure through 
the incidence matrix. Furthermore, an adaptive estimation approach based on the 
gradient descent method is utilized to optimize the hyperedge weights. 

The optimized ranking vectors obtained from the hypergraph provide valuable 
insights into images associated with POIs or image labels that are relevant for 
recommendation purposes. Additionally, a complementary system for semantic 
annotation is developed and integrated into the framework. This system leverages 
visual features extracted from the images using a convolutional neural network 
(CNN). 

The proposed hypergraph-based system captures and exploits semantic infor- 
mation among nodes, incorporating tags associated with each image, geographic 
coordinates of depicted POIs, and the visual characteristics of the images. The 
performance evaluation of the proposed recommendation system encompasses vari- 
ous parameters, with particular attention given to the optimization equations and 
the hypergraph’s topology, in order to analyze their individual contributions in 
detail. Detailed mathematical analyses are provided for the optimizations of the 
gradient vector, the hypergraph’s topology, and the hyperedge weights within the 
proposed system. 

To further enhance the accuracy of the system, we introduce an alternative 
approach based on the Least Mean Square (LMS) method. This approach aims 
to adjust the weights of the hyperedges and is compared to the traditional closed- 
form method. By employing the LMS method, we strive to improve the system’s 
performance and refine the accuracy of the ranking vector computation. 

Through this proposed adaptive learning framework for hypergraph-based rank- 
ing computation, tailored for the faithful and personalized recommendation of 
POIs, we address the need for efficient and accurate systems. The mathematical 
analyses presented offer insights into the optimizations employed, shedding light 
on the system’s performance characteristics. 


e Case 5: The utilization of randomized linear algebra in Singular Value 
Decomposition (SVD) in adaptive hypergraph weight estimation allows for a 
reduction in the computational complexity associated with matrix inversion 
for deriving the ranking vector, while simultaneously upholding accuracy. 


In Chapter [6] we present two distinct approaches aimed at mitigating the com- 
putational burden associated with optimizing the ranking vector in hypergraphs. 
The ranking vector should provide faithful, personalized recommendations. The 
objective is to achieve efficient solutions while reducing the computational cost. 
To this end, randomized algorithms are employed to perform low-rank matrix 
factorizations, enabling an approximate decomposition that can subsequently be 
inverted. 
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To tackle the challenge of high computational complexity resulting from inver- 
sions of Laplacian or adjacency matrices, we propose a novel approach based on 
randomized linear algebra techniques. Specifically, we integrate block randomized 
SVD via subspace iteration within adaptive hypergraph weight estimation for im- 
age tagging. By leveraging tessellation to create low-rank submatrices along the 
main diagonal, fast matrix inversions can be achieved through randomized SVD. 
This reduction in runtime significantly enhances computational efficiency while 
maintaining comparable levels of accuracy. 

The utilization of randomized algorithms for low-rank matrix factorizations in 
the context of hypergraphs presents a promising avenue for optimizing the ranking 
vector while alleviating the computational cost. The proposed approach, integrat- 
ing block randomized SVD via subspace iteration and adaptive hypergraph weight 
estimation, provides a computationally efficient solution for image tagging appli- 
cations. Through the creation of low-rank submatrices using tessellation, rapid 
matrix inversions via randomized SVD are achieved, resulting in a notable reduc- 
tion in runtime without compromising accuracy. These advancements contribute 
to the development of efficient techniques for ranking optimization in hypergraphs, 
addressing the computational challenges associated with large-scale datasets. 


e Case 6: Employing a supervised contrastive learning framework in conjunc- 
tion with a style transfer module to augment the training samples can result 
in enhanced discrimination between natural images and computer-generated 
images, thereby improving efficiency. 


The third part of this thesis extensively investigates supervised contrastive 
learning as the primary approach for effectively discerning natural images (NIs) 
from computer-generated images (CGIs). To this end, in Chapter [7| we propose a 
comprehensive framework based on CNNs that facilitates the learning of intricate 
representations through supervised contrastive learning. 

Enhancing the framework, we introduce a complementary style transfer module 
that significantly improves the efficiency of image discrimination. This module en- 
riches the training process by incorporating additional samples, even when only a 
subset of the original dataset is available. Operating in real-time, the style transfer 
module aligns the content manifold with the style manifold, further augmenting 
the discrimination capabilities of the model. In the initial stage, a CNN based on 
the ResNet-18 architecture is trained using supervised contrastive learning. Sub- 
sequently, in the second stage, the trained model from the first stage serves as 
input to a linear classifier, employing a cross-entropy loss function. Moreover, to 
enhance the overall generalization and stability of the model, the proposed frame- 
work incorporates stochastic weight averaging at the completion of each stage. 
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By presenting this framework for supervised contrastive learning, along with 
the complementary style transfer module, we address the challenge of distinguish- 
ing between natural and computer-generated images. The application of CNNs 
and the utilization of stochastic weight averaging contribute to the effectiveness 
and robustness of the model, facilitating accurate discrimination and improving 
overall performance. 

The conceptual idea of the overall research roadmap utilized in this doctoral 
dissertation is summarized as follows: 

The initial significant milestone in our research endeavors encompassed an in- 
depth exploration of multimedia authentication using ENF, focusing specifically on 
the examination of parametric and non-parametric spectral estimation methods. 
The objective was to identify crucial parameters for accurate ENF estimation and 
surpass the existing standard procedures prevalent in the literature. To achieve 
this, we critically analyzed the conventional rectangular window and delved into 
a diverse range of temporal and lag spectral windows to enhance the precision of 
ENF estimation. Through the integration of a customized lag window within the 
Blackman-Tukey spectral estimator, we challenged the existing approaches and 
achieved superior accuracy in ENF estimation. Furthermore, we introduced a fast 
approach of the filter-bank Capon estimator, accompanied by effective filtering and 
preprocessing techniques, which effectively mitigated interference and facilitated 
ENF estimation. 

Expanding beyond digital audio recordings, we also addressed the growing in- 
terest in multimedia authentication applications involving digital video recordings. 
Our subsequent milestone involved the transition from ENF estimation in audio 
recordings to video recordings in both static and non-static environments. Going 
beyond the existing methods in the literature that rely on background subtrac- 
tion techniques, we proposed an automatic approach that disregards the static or 
non-static nature of video recordings. By assuming that the luminance and its 
fluctuations from fluorescent lamps can be embedded in objects, whether static 
or non-static, in a frame-based manner, our proposed approach utilized the SLIC 
algorithm to generate regions with common characteristics. This mechanism en- 
sured highly accurate and automated ENF estimation in both static and non-static 
digital video recordings. 

Driven by the ability of hypergraphs to model complex and high-dimensional 
relationships among entities, we introduced an efficient hypergraph learning frame- 
work to address the challenge of faithful, personalized image and tag recommen- 
dation in touristic scenarios. This framework employed multiple optimizations to 
derive accurate recommendations. Specifically, we optimized the topology of the 
hypergraph, represented by the incidence matrix, as well as the weights of the 
hyperedges, and the ranking vector. Additionally, we made a noteworthy contri- 
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bution by analytically deriving the optimization functions, which can guide further 
advancements in the proposed framework by the research community. 

Lastly, in response to the surge of forgery incidents in multimedia content, 
particularly in CGIs, we presented a CNN-based framework to tackle the broader 
issue of multimedia disinformation. Our proposed framework aimed to distinguish 
NIs from CGIs by incorporating supervised contrastive learning within a CNN 
based on the ResNet-18 architecture. Furthermore, we incorporated a style transfer 
module to enable accurate discrimination even in real-world scenarios where the 
availability of training samples is limited. The proposed framework demonstrated 
highly accurate results, outperforming state-of-the-art techniques and showcasing 
statistically significant improvements. 
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Chapter 2 


Blackman—Tukey spectral 
estimation and electric network 
frequency matching from power 
mains and speech recordings 


2.1 Introduction 


Multimedia content has attained pervasive presence across diverse domains of ev- 
eryday existence, encompassing content that is publicly accessible as well as con- 
tent of a strictly private nature. The widespread dissemination of information has 
precipitated a notable upsurge in the manipulation and adulteration of multime- 
dia content. Although there is an undeniable proliferation of multimedia forgery, 
substantial endeavors have been devoted to the development of effective counter- 
measures aimed at detecting and thwarting such illicit activities. [1] [2]. 

Multimedia forensics assumes a crucial role in the extraction of valuable evi- 
dentiary information from multimedia content, serving as a powerful tool in crime 
investigations. However, the task of ensuring the authentication and integrity 
validation of multimedia content has become progressively more challenging [8]. 
Furthermore, with the extensive adoption of cloud computing, this technology has 
become an attractive target for malicious activities. Consequently, substantial 
attention has been dedicated to the development of effective and efficient tools 
tailored for the domain of cloud forensics [4] {5}. 

Audio and image forensics encompass a broad spectrum of applications, includ- 
ing fraud detection, determining the recording’s origin, content authentication, and 
detecting edits. Despite notable advancements in digital forensics, the continuous 
proliferation of systems capable of editing audio recordings imperceptibly poses a 
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persistent threat. Image forensics, on the other hand, focus on developing tools to 
determine whether visual content has been modified or altered. Consequently, sig- 
nificant efforts have been dedicated to enhancing their effectiveness and optimizing 
their efficacy. [6} [7] {8} [9]. Copy-move attacks tend to be very common in the field 
of image forensics. Efforts have been paid on developing robust techniques for their 
detection [10] {7} [12]. A robust method employing Discrete Cosine Transform and 
Singular Value Decomposition in order to detect this kind of attacks was presented 
in [13]. Robust scale and translation invariant features were employed to detect 
copy-move attacks. A block-based method employing Fourier-Mellin Transforma- 
tion for copy-move forgery detection was proposed in [14]. Watermarking has been 
used for image tamper detection. Many approaches based on watermarking have 
been proposed in order to determine whether an image is authentic or has been 
modified [78]. 

Audio forensics are intensively studied [22}. Audio reverberation 
significantly affects the quality of an audio recording and small differences can 
indicate alterations in the original recordings. A forensic tool was presented in 
[23] that models reverberation. Background noise can be employed in order to 
determine the integrity of an audio recording, but speech leakage is evident. A 
novel framework was introduced for background noise estimation [24]. Splicing 
detection in audio recordings constitutes a major scenario in audio tampering 
detection. CNNs are able to derive representative features of an audio recording. 
A novel method was proposed in [25] for splicing detection that employs CNN in 
order to derive such features using the spectrogram of audio recordings. 

This Chapter focuses on the design of a lag window specifically tailored for 
the non-parametric Blackman-Tukey spectral estimator. The aim is to effectively 
detect ENF signals in power mains recordings as well as speech recordings. The 
developed window strikes a balance between minimizing smearing and leakage to 
ensure accurate detection of ENF, which typically possesses weaker power com- 
pared to other nearby frequency components such as pitch or the first formant 
of vowels. By incorporating the designed window into the BT method, superior 
detection performance is achieved in both power mains and speech recordings 
compared to existing state-of-the-art approaches. It is important to note that the 
proposed approach, referred to as BT+Lag Window Design (LWD), undergoes 
rigorous evaluation using real-world datasets, following the established practices 
described in the literature. Two evaluation metrics, namely the Matthews Corre- 
lation Coefficient (MCC) and the mean Squared Deviation Error (mSDE) between 
the estimated and reference ENF signals, are utilized to assess the accuracy of ENF 
estimation. Additionally, hypothesis testing is conducted to determine the statis- 
tical significance of the improvements in ENF estimation accuracy, as measured by 
MCC and mSDE, achieved by the proposed approach compared to state-of-the-art 
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methods. It is crucial to emphasize that this chapter addresses an estimation prob- 
lem wherein ENF is extracted using spectral estimation methods when a reference 
ENF signal is available. The evaluation of ENF estimation quality is performed 
through matching, as commonly practiced in related literature. 

In summary, the main contributions of this Chapter are as follows: 


e A Blackman-Tukey approach is proposed, which employs a custom designed 
lag window for efficiently estimating the power spectrum of the ENF com- 
ponent in recordings from power mains and speech recordings. 


e A systematic study is conducted regarding parameter selection in designing 
the proposed lag window enabling accurate ENF estimation. Proper selection 
of parameters reduces leakage, affecting ENF estimation. 


e Experiments are conducted on real-world benchmark datasets and extensive 
comparisons with state-of-the-art methods are made. 


e Emphasis is put on the details of band-pass (BP) filtering of raw signal prior 
to spectral analysis and the fine tuning of parameters involved in spectral 
analysis techniques enabling the report of more accurate results than those 
disclosed in the relevant literature. 


e In addition to existing matching procedures between the extracted ENF 
time series and the ground truth one of equal length, efficient dynamic time 
warping (DTW) is employed and assessed, which allows the time series to 
have different lengths and eliminates the need to downsample the ground 
truth ENF signal, as is tacitly assumed in the literature. 


e Hypothesis testing asserts that the improvements in ENF estimation accu- 
racy with respect to (w.r.t.) MCC and mSDE between the proposed ap- 
proach and those disclosed in literature are statistically significant. 


2.2 Related work 


A non-parametric iterative adaptive approach (IAA) for spectral estimation was 
proposed in [26]. Furthermore, a novel method for frequency estimation based 
on dynamic programming, seeking for the minimum cost path among per-frame 
frequencies was introduced also in [26]. Another approach was proposed employ- 
ing Maximum-Likelihood estimator (MLE) for multi-tone and single tone signals 
[27]. In particular, the Cramer-Rao bound was used to estimate the variance of 
ENF estimator. ENF is present at multiple harmonics of the nominal frequency. A 
method weighting ENF with the local signal-to-noise-ratio (SNR) of each harmonic 
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was described [28]. An ENF estimation algorithm based on Discrete Fourier Trans- 
form (DFT) was proposed in [29]. There, a binary approach was used to seek for 
specific spectral lines instead of the entire frequency band. Interference in speech 
signals hinders ENF estimation. To cope with that problem, robust principal com- 
ponent analysis was employed for noise reduction and weighted linear prediction 
for ENF estimation was introduced in [80]. Fine tuning of signal filtering and 
parametrization for ENF estimation in recordings from power mains and speech 
recordings can increase ENF estimation accuracy. In this context, a systematic 
study of various techniques in combination with proper parametrization was con- 
ducted in [31]. An ENF esimation method in audio recordings employing frequency 
demodulation was proposed in [82]. Different noise conditions were established in 
order to test different methods. A variety of parametric and non-parametric fre- 
quency estimation methods were employed for high precision ENF estimation : 
A method based on instantaneous frequency estimation using the Hilbert trans- 
form was proposed in [84]. Time requirements and estimation accuracy crucially 
affect ENF real-world applications. Window selection can significantly improve 
accuracy without affecting time requirements at all. A novel accurate approach 
employing a fast Capon-based spectral estimator after applying a temporal Parzen 
window was proposed in [35]. In most cases, ENF signal in audio recordings suffers 
from strong interferences. Filtering may enhance the ENF signal in such cases. 
A filtering algorithm was proposed in [86], which employs a kernel function to 
create a time-frequency representation facilitating ENF estimation. A novel ENF 
estimation scheme was proposed in [37], which employs Least Absolute Deviation 
regression for finding the regression weights and minimization of objective func- 
tions w.r.t. frequency in an alternating way. The existence of reference ENF is 
of high importance in multimedia authentication tasks. A method for creating a 
reliable ENF database, employing multiple frequency sensors was detailed in [88]. 
An automated general scheme for ENF estimation was proposed in [89]. 

Machine learning algorithms can indicate the region where the recordings were 
captured [40]. A multi-label machine learning approach exploiting various ENF 
signal features to identify the region-of-recordings was examined [41]. CNNs can 
be exploited in order to learn features emerging from ENF audio recordings. A 
CNN-based system using spectrograms for audio recapture detection was proposed 
in [42]. 

ENF estimation accuracy in audio recordings is intuitively based on several 
stochastic and deterministic factors. The first comprehensive study on the nature 
of factors affecting ENF capture was conducted in [43]. Malicious attacks in power 
grids can be prevented when future ENF is known. Two algorithms were proposed, 
employing correlation kernel regression and autoregressive moving average for ENF 
forecasting [44]. 
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For completeness, we refer to tamper detection. ENF variations can be ex- 
ploited to facilitate edit detection in multimedia recordings, as proposed in [45]. 
Phase discontinuities are created due to insertions and deletions. Phase change 
analysis via high-precision Fourier analysis was adopted to justify the authenticity 
of multimedia recordings [46]. A method employing an estimation of signal pa- 
rameters with rotational invariant techniques (ESPRIT) and exploiting kurtosis 
for detecting abnormalities in ENF varations was suggested [47]. Time is criti- 
cal in detecting multimedia alterations. A support vector machine (SVM)-based 
framework for automatic detection of such alterations based on the kurtosis of 
ENF disturbances was introduced [48]. Features of ENF signal were extracted and 
utilized for digital audio authentication. An SVM-based framework for automatic 
tamper detection with feature fusion was proposed in order to overcome difficul- 
ties in visual tamper inspection [49]. Finding edited areas constitutes a significant 
issue in digital forgery detection. A method based on the max offset in cross cor- 
relation between the estimated ENF and the reference one for edit detection was 
introduced in [50]. An algorithm based on inter-frame video forgery detection for 
frame deletion, duplication, and insertion was proposed [51]. Timestamp verifi- 
cation and tamper detection were integrated within an audio verification system 
52|. A measurement method was utilized based on absolute errors between the 
extracted ENF and the reference signal called Absolute-Error-Maps. ENF can 
be exploited as a fingerprint for audio and video synchronization applications by 
aligning the embedded ENF signals. A scheme for multimedia synchronization 
based on ENF was proposed in [53]. 


2.3. Electric Network Frequency overview 


2.3.1 The ENF criterion 


The ENF criterion, originally introduced by Catalin Grigoras [54] [55], has emerged 
as a groundbreaking tool in the field of forensics. It serves to verify the authenticity 
of digital recordings, determine their time of capture, and provide insights into the 
geographical location where they were recorded. The ENF signal is inherently 
embedded in digital audio and indoor video recordings, and the development of 
estimation techniques aims to extract this signal optimally. The ENF is present 
not only at its nominal frequency but also at its harmonics. However, the accurate 
estimation of ENF is impeded by various interferences and challenging low SNR 
conditions. Consequently, numerous approaches have been proposed to address 
these difficulties and effectively extract the ENF signal [39]. 

The ENF signal exhibits instantaneous deviations from the nominal frequency, 
which is set at 50 Hz in Europe and 60 Hz in the US [54]. These fluctuations 


37 


Chapter 2. BT spectral estimation and ENF matching from power mains and 
speech 


arise due to the stochastic differences between the demanded power and the power 
actually generated, resulting in continuous variations in the rotational speed of 
energy generators within power plants. To capture the ENF signal, specialized 
sensors known as frequency disturbance recorders (FDR) are employed, offering 
precise measurements with an accuracy of up to +5-10~* Hz [26]. The ENF signal 
possesses several notable properties, including the following: 


e The ENF signal exhibits random fluctuations around its nominal value. 
e These fluctuations are consistent within the same power network. 


e In addition to the nominal frequency, the ENF signal is also present in higher 
harmonics ; 


Accurate methods should be developed in order to isolate and extract the 
embedded ENF traces and determine the integrity of multimedia content. Having 
extracted the ENF signal, a comparison against a reference ENF database should 
be made. 

In Europe, ENF is defined as f = [50 + Afg| Hz, where Afg denotes the 
ENF fluctuations around the nominal frequency [54]. Based on ENF fluctuations 
Afg, there are three different conditions, indicating the proper function of network 


operation [59]: 
e If Afg < 20mH2z, the network is operating properly. 


e If 20mHz < Afg < 200mHz, ENF fluctuations exceed normal limits, but 
there is no danger for network operation. 


e If Afg > 200 mHz, a major risk in electric network incurs. 


Many factors should be taken into consideration to efficiently estimate ENF. 
Depending on the nature and properties of ENF signal, three main categories of 
ENF estimation methods can be established based on the analysis domain [55]: 


e Time/Frequency: A visual comparison is made between the spectrogram and 
the reference signal. Usually, it is employed for short-time recordings. 


e Frequency: The periodogram is computed and the frequency associated to 
its maximum magnitude (i.e., spectral peak) is estimated over short time 
segments. The corresponding frequency is compared against reference ENF 
signal. This approach exploits various spectral estimation methods. A sharp 
bandpass zero-phase FIR filter should be applied on the raw signal, prior to 
spectral estimation. 
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e Time: Having applied a sharp bandpass zero-phase FIR filter to the raw 
signal, zero-crossing measurements around the frequency of interest (i.e., 
fundamental frequency or its harmonics) are conducted. 


2.3.2 Accuracy of ENF estimation 


After ENF estimation, a matching procedure is applied in order to objectively 
assess estimation accuracy. Having a ground truth (reference) dataset, two met- 
rics, namely, the MCC and the mSDE [60], are used to compare the 
extracted frequencies against the ground truth ones. Using the notation intro- 
duced in [26], let f = [fi, fo,..., ei be the estimated ENF signal at each second. 
Let also g = [91, go,... 92" for K > K be the reference ground truth ENF and 
BL) = |G Ghats guar be a segment of g starting at /. The following index 
is determined: 


lopt = argmaxc(l) (2.1) 
l 
where 1 = 1,2,...,K —K+1and c(l) is the sample correlation coefficient between 
f and g(/) defined as: 
f' g(l) 
c(l) = ——__. (2.2) 
PaIPRIEAGIE 


For the second metric, mSDE, the best index is found by minimizing the 
squared error between f and g(I), i.e., 


lope = argmin®—*+ |I¢ — g(7)||?. (2.3) 
l= 


2.4 Dataset and estimation procedure 


The presence of ENF signals is known to be significantly influenced by the record- 
ing environment. Additionally, the performance of ENF estimation approaches 
may vary depending on the specific recording conditions. Therefore, it is essential 
to evaluate ENF estimation methods using diverse datasets encompassing var- 
ious environmental conditions and SNRs. In this study, we utilize two bench- 
mark datasets obtained from the University of Florida’s repository (available at 
www.sal.ufl.edu/download.html) to assess the effectiveness of the proposed ap- 
proach. These datasets consist of real-valued signals and have been previously 
discussed in [26] and employed in subsequent works such as [27] {30} (31) [35). 

The first dataset, referred to as Data 1, was captured by directly connecting 
an electric outlet to the internal sound card of a desktop computer using a voltage 
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divider. This dataset exhibits a high SNR, indicating a strong signal presence. On 
the other hand, the second dataset, known as Data 2, comprises speech recordings 
obtained from the internal microphone of a laptop computer and exhibits a low 
SNR, indicating the presence of significant interference. The original datasets were 
sampled at 44.1kHz with a resolution of 16 bits per sample. Prior to analysis, 
the recordings were downsampled to 441 Hz, using appropriate anti-aliasing filters. 
Consequently, apart from the fundamental frequency at 60 Hz, higher harmonics 
can be exploited for ENF estimation. It is worth noting that both datasets should 
contain identical ENF signals since they were captured simultaneously within the 
same interconnected network. Furthermore, a reference ground truth dataset con- 
taining the actual ENF signal was recorded using an FDR for comparison against 
the estimated signals obtained from Data 1 and Data 2. 

The initial step in ENF estimation involves the application of a bandpass filter 
to the raw signal, targeting the frequencies of interest. Since ENF is present 
at the fundamental frequency and its harmonics, the same filtering procedure is 
performed for each harmonic. Proper selection of the bandpass edges and filter 
order during the filtering process has been shown to significantly enhance ENF 
estimation accuracy, as demonstrated in |31]. For Data 1, the 1st, 2nd and 3rd 
harmonics were utilized to ensure consistency with the existing literature. A sharp 
zero-phase Finite Impulse Response (FIR) filter with a bandpass edge at 59.9 Hz 
and 60.1 Hz and a filter order of 1501 was applied around the first harmonic. 
Similarly, for the 2nd harmonic, the bandpass edges were set at 119.98 Hz and 
120.02 Hz, while maintaining the same filter order as the first harmonic. For the 
3rd harmonic, the bandpass edges were set at 179.9 Hz and 180.1 Hz, and the 
filter order was reduced to 1001. In all cases, a Hamming window with a length 
equal to the filter order was employed. 

For Data 2, only the 2nd harmonic was utilized in the experiments, as stated 
in the literature, since the other harmonics suffered from extremely low SNR. A 
sharp zero-phase FIR filter was applied around the second harmonic of the speech 
recordings, with bandpass edges set at 119.9 Hz and 120.1 Hz. while the filter 
order was set at 4801. It is worth mentioning that selection of filter order affects 
critically ENF estimation accuracy. Different filter order needs to be defined at 
each harmonic to obtain the best results. As we increased the filter order, the 
corresponding transition width was reduced yielding more accurate results. As a 
matter of fact, for the low SNR Data 2, filter order was set at 4801, while for 
the high SNR Data 1 the filter order was set at 1001. Afterwards, the signal was 
split into V overlapping frames. Each frame of size L was shifted by S sec from 
its immediate predecessor frame and was multiplied by an L-size temporal win- 
dow. Here, the best results for each harmonic in Data 1 and Data 2 were derived 
employing an L-size Parzen window and an L-size rectangular window, 
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Table 2.1: Frame parameters (in sec) 


Parameters Data 1 Data 2 
Time shift, S i 1 
Frame length, L 20 33 


respectively. Frame lengths for Data 1 and Data 2 are indicated in Table The 
choice of temporal window is of crucial importance in ENF estimation. In , OX- 
tensive experiments were conducted in order to determine the best window choice. 
Next, the power spectrum is estimated for each frame and an approximate fre- 
QUENCY Wgmax associated to the maximum power spectrum magnitude is extracted. 
In order to derive a more precise frequency estimation, a quadratic interpolation is 
employed. Thus, a quadratic model is fit to the logarithm of the estimated power 
spectrum [26]. 


2.5 Methods 


2.5.1 Blackman-Tukey 


Let yw = [y(2),.-., y@+N — 1" be the 7” data segment containing the N = 
8820 and N = 14553 raw samples for L = 20 and L = 38, respectively, after 
bandpass filtering and (-)' denote transposition. The autocorrelation between 
real-valued y(t) and y(t — n) is defined as: r(n) = E{ y(ty(t — n)}, where E 
denotes the expectation operator. 

The standard biased estimate of r(n) is given by [64]: 


A(n)=— S> y(t)y(t—n), O<n<N-1 (2.4) 


t=n4+1 


For negative lags, 7(—n) = r(n). Classic periodogram-based methods suffer 
from high variability due to the accumulation of estimation errors. Blackman- 
Tukey method yields a refined periodogram [64]: 


M-1 


darw)= YS wre *s (2.5) 


(=-(M-1) 


where w(¢) is an even lag-window function. Here, M will be defined subse- 
quently. For each data segment, N = LF, where F, is the sampling frequency. 
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For increased resolution, dense frequency samples are employed as we = au: , 
€=0,1,..., 2-1, where = =4N =4LF,. 7 
Proper selection of lag window may lead to an improvement of ENF estimation 
accuracy. Depending on the nature of the application, the best window choice will 
yield the desirable results. For each frame, the Power Spectral Density (PSD) (2.5) 
attains a maximum at € € [0, = — 1] or fe = “@* F,. The computational com- 


al 
= 


1 
plexity of Blackman-Tukey spectral estimator is about = 5 log. (=) + 2 log,(2=) 
[64], which is not prohibiting for large datasets. 


2.5.2 Lag window design 


Taking into account the noisy nature of ENF signal, a lag window is designed, 
coined as LWD, in order to reduce interference that occur in ENF signals and 
increase the ENF estimation accuracy |65]. The lag window will be used in BT 
spectral estimator. Leakage reduction is the main objective of lag window design. 
The tradeoff between leakage and smearing should be taken into consideration for 
the selection of window shape. In ENF applications, especially when it comes to 
speech recordings, strong interference may mask the ENF signal through leakage. 
To this end, the design parameter 6 employed herein, compromises smearing for 
reducing leakage as much as possible. 
Let a discrete impulse response d = [d(0),...,d(M — 1)" and 


a(w) = [1,e7™,...,e@ MD] (2.6) 
be the vector of Discrete-time Fourier Transform (DTFT). DTFT of d can be 
written as: 

D(w) = d*a(w) (2.7) 


where (-)* stands for the Hermitian transposition of complex-valued vectors, fol- 
lowing the notation in [64]. The spectral window is derived as [64]: 


* 


W(w) = |D(w)? (2.8) 


where D(w) is the frequency response of any window. 
The corresponding positive semi-definite lag window is given by the auto- 
correlation of d, i.e., for —-(M —1) <¢ < (M-1) 


M-1 


w(6) = Yo d(k)a"(k - 6) (2.9) 


k=0 


42 


Chapter 2. BT spectral estimation and ENF matching from power mains and 
speech 


where * for scalars stands for conjugation. 

Let W(w) be the frequency response of the window function given by (2.9). 
The following maximization problem should be solved in order to maximize the 
relative energy in the main lobe of window W(w) [64]: 


°T W(w) dw 
max {mcf (2.10) 


For accurate ENF estimation, the choice of the window design parameter 0 
should provide the best tradeoff between leakage and spectral resolution. By in- 
creasing 0, leakages will be mitigated at the expense of reduced spectral resolution. 
Here, we employ 0 = 3.5/M and @ = 2.1/M for Data 1 and Data 2, respectively. 
Taking into account that 

1 On 9 1 On 

a Jon Dw) |f dw = d* E ~ gn O(W)a*(w) dw} d 

21 i _ 21 us (2 11) 
1 pn 7 d*d 
= JZ, |D we) dw 


the maximization problem (2.10) can be rewritten as the maximization of Rayleigh 
quotient: 


d*Wd (2.12) 
max ; 
a ded 
where W is the Toeplitz matrix 
1 On : A 
v= — a(w)a*(w) dw = [Wm-n] (2.13) 
2 —On 
with elements 
1 On : 
Um—n = 57 em" )4 dy = A sinc|(m — n) Or]. (2.14) 
20 —On 


A sina 


Thus, W is a real 


symmetric Toeplitz matrix. Accordingly, its eigenvectors have real entries. 

The optimal lag window, which maximizes the relative energy in the main lobe 
of W(w) is obtained, when d is chosen as the principal eigenvector of W [64]. 
Here, d is a real vector. The best results of each harmonic w.r.t. MCC and mSDE 
are derived employing M = 0.25LF, and M = 0.5LF, for Data 1 and Data 2, 
respectively. The window function w(¢) and its corresponding frequency response 


are depicted in Figs. and respectively. 


In (2.14), the sinc function is defined as sinc(z) 


43 


Chapter 2. BT spectral estimation and ENF matching from power mains and 
speech 


0.8 


S 
a 


Amplitude 


0.4 


0.2 


1000 2000 3000 4000 5000 6000 7000 
Samples 


Figure 2.1: Window function w(¢) in the time domain for M = 0.5LF;, (Data 2). 


2.5.3 Quadratic interpolation 


In order to calculate a more precise estimate of ENF, quadratic interpolation (QI) 
is employed. Thus, a quadratic model is fit to the logarithm of the estimated 
power spectrum about Wwemax [63]. QI enables high resolution ENF estimation 
in combination with low time requirements. The frequency sample wemax, which 
corresponds to the maximum value of spectral magnitude is extracted as an ap- 
proximate ENF estimate for each frame. The procedure is briefly described, as 
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Figure 2.2: Frequency response of spectral window W (w) for M = 0.5LF, (Data 2). 


proposed in [39]. Let A. = log bpr(Wemaxta)s a = —1,0,1, for each frame: 


e Select the bin of the maximum power spectrum, Ao = log db BT (Wemax)- 


e Select the two adjacent bins of Xo, ie., A_1 = log bpr(Wemax-1) and Ai = 


log opr (Wemax-+ ) : 


e Calculate the quadratic peak 6, which corresponds to an improved ENF 
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estimation. 


The improved ENF estimate is obtained as w = Wemax +6, where 


ft ee 
~ 2041-2041 


é (Wemax+1 — Wemax) (2.15) 

In Sec. pairwise differences between the MCC delivered by the proposed 
approach and that of state-of-the-art ENF estimation ones were calculated in or- 
der to assess whether they are statistically significant. Fisher’s transformation was 
employed for this purpose. A similar procedure was followed regarding the sta- 
tistical significance of the differences delivered by mSDE. Dynamic time warping 
can also be used as an alternative for matching the extracted ENF to the ground 


truth [31]. 


2.6 Experimental evaluation and statistical tests 


The approach described in Sec. [2.5]was implemented on the two datasets described 
in Sec. The ENF was estimated every second for the total duration of 30 
minutes in each dataset utilizing the parameters outlined in Table[2.1] Regarding 
Data 1, the 1st, the 2nd, and the 3rd harmonics were used for ENF estimation. 
Regarding Data 2, only the 2nd harmonic was exploited, because the other two 
harmonics were too weak. This was due to the strong interference present in speech 
recordings. To assess the effectiveness of the proposed approach, comprehensive 
comparisons were conducted against state-of-the-art methods, which were applied 
to the same datasets under identical experimental conditions. 


2.6.1 Data l 


The high SNR nature of Data 1 enables accurate ENF estimation for all three 
harmonics. For the 1st harmonic of ENF, the proposed BT+LWD, which employs 
the designed lag window resulted to MCC of 0.9990 outperforming the Welch 
method, which resulted to an MCC of 0.9983 in [81]. There, the MCC between 
the ENF signal estimated by the BT method and the ground truth ENF was 
measured to be 0.9924. The lag window employed in the aforementioned case of 
BT estimate was the rectangular one. An MCC of 0.9900 was reported in 
for the Time-Recursive Iterative Adaptive Approach with frequency tracking, i.e., 
TRIAA (Track), while their proposed approach TRIAA without Track, reached 
0.9895. The same figure of merit was 0.9917 for the Short-time Fourier Trans- 
form (STFT) (Track) implementation [26]. The fast version of Capon estimator, 
which combines the quality of power spectral estimate delivered by the Capon 
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method with low computational time requirements, achieves an MCC of 0.9922 
[35]. The proposed BT+LWD outperformed all state-of-the-art approaches em- 
ploying a value of M = 0.25LF, and 6 = 3.5/M for the designed lag window. 
This approach constitutes a reliable tool for ENF estimation, because of the high 
accuracy and the low time requirements. Specifically, it outperforms the state-of 
the-art TRIAA (Track) both in terms of correlation coefficient and computational 
complexity. Generally, periodogram-based methods perform better in the presence 
of high SNR. Regarding the 2nd harmonic, there seemed to be more interference 
than that present at the lst one. The proposed approach BT+LWD resulted to 
an MCC of 0.9930. BT+LWD outperforms TRIAA method, which resulted to an 
MCC of 0.9902. Fast Capon also performed well and achieved an MCC of 0.9913. 
The proposed approach outperforms Welch and BT methods, which they resulted 
both to an MCC of 0.985. The 3rd harmonic is the most prominent one. A 
large number of results disclosed in literature for ENF estimation refer to the 3rd 
harmonic, exclusively. The proposed BT+LWD achieved an MCC of 0.9990 for 
M = 0.25LF, and 6 = 3.5/M, outperforming all its competitors in ENF estima- 
tion at the 3rd harmonic. ML approach [27] yielded an MCC of 0.9977, while the 
conventional BT approach delivered an MCC of 0.9978 in [31). MCC was 0.9961 
for the TRIAA (Tracking), lagging behind STFT (Tracking). When employing the 
linear prediction approach [80], MCC was measured 0.9982, while its enhanced de- 
noising version, called linear prediction with robust principal component analysis 
(RPCA) reached an MCC of 0.9984, the top performance at the time it was pub- 
lished. Furthermore, remarkable results were obtained by fast Capon, yielding an 
MCC of 0.9990. The proposed approach improved ENF estimation, outperform- 
ing its competitors and it was established as reliable solutions for all harmonics. 
At this point, it is worth mentioning that ENF estimation in a single harmonic 
is considered to be sufficient enough in real-world applications. Many works in 
the literature estimate ENF employing one harmonic only. Detailed results for all 
harmonics and approaches are shown in Table 

The differences between the proposed approach and all other methods should be 
assessed whether they are statistically significant. To this end, hypothesis testing 
was applied, employing Fisher transformation. The null hypothesis, Ho: c, = co, 
indicates that MCCs are equal and the alternative one, Hj: c, # C2, indicates the 
opposite. For each pair of approaches, the associated MCCs undergo Fisher’s z 


transformation : 


Le 
= (1.61 : 2.16 
Zz ni_. (2.16) 


The number of samples is K = 1800. The test statistic is given by: 
dF = VK —3(2 — 29) (2:17) 


Comparisons were conducted between the proposed approach at each harmonic 
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and the rest of approaches w.r.t. MCC, as reported in Table In all cases, gr 
was calculated and found to be outside the region of acceptance for significance 
level 5%. Consequently, there is sufficient evidence to warrant the rejection of the 
null hypothesis at significance level 5%. Moreover, hypothesis testing at signifi- 
cance level 1% was conducted. In all cases, gr was calculated and found to be 
outside the region of acceptance —2.58 < qr < 2.58. Consequently, there is suffi- 
cient evidence to warrant the rejection of the null hypothesis at significance level 
1%, too. Therefore, the differences between MCC are statistically significant. It 
is worth mentioning that the proposed approach demonstrates statistically signifi- 
cant improvements compared to the conventional BT method. Moreover, an L-size 
Parzen window was employed as a temporal window prior to spectral estimation, 


as explained in Sec. 


Table 2.2: Maximum correlation coefficient for various approaches applied to 
Data 1. 


Approach 60Hz 120Hz 180Hz 
Linear Prediction = = 0.9982 
Linear Prediction (RPCA) _ - 0.9984 
ML — — 0.9977 
Welch|3]] 0.9983 0.9850 0.9983 
BT 0.9924 0.9850 0.9978 
BT+LWD (Proposed) 0.9990 0.9930 0.9990 
TRIAA 0.9895 0.9902 0.9961 
TRIAA (Track) 0.9900 0.9946 0.9961 
STFT [26] 0.9912 0.9911 0.9968 
STFT (Track) [26] 0.9917 0.9949 0.9968 
Fast Capon 0.9922 0.9913 0.9990 


Table 2.3: Minimum standard deviation of error for approaches applied to Data 1. 


Approach 60 Hz 120 Hz 180 Hz 

STFT [26] 2721? B74 10. 1,000<10-" 
STFT (Track) 2.650: 10-3 2.145-10-3 1.851 - 10-3 
ML = = 0.760 - 10-3 
BT 2984210" 220210 J218+10-* 
BT+LWD (Proposed) 0.838-107? 1.746-107? 0.832 - 1073 
TRIAA [26] 3.032. 10-3 2.822-10-3 1.999- 10-3 


TRIAA (Track) 2.919-10-% 2.198-10-3 1.999-10-% 


48 


Chapter 2. BT spectral estimation and ENF matching from power mains and 
speech 


The second metric employed is the mSDE. The empirical findings are summa- 
rized in Table A limited amount of papers employs this metric. It is worth 
mentioning that there are approaches, which although yield MCC exceeding 0.99, 
their mSDE is relatively high. For this reason, experiments employing mSDE are 
conducted systematically to point out the efficacy of the approach under exami- 
nation. 

As stated previously, better results are obtained when the 3rd harmonic is used. 
BT+LWD mSDE at the 1st harmonic was measured 0.838 - 10~°, outperforming 
STFT (Track) and TRIAA (Track), which yielded an mSDE of 2.650 - 107? and 
2.919 - 10-3, respectively. Moreover, the proposed BT+LWD outperforms BT, 
which resulted to an mSDE of 2.284-10~%. TRIAA reached an mSDE of 3.032 - 
10-°, while conventional STFT attained 2.772 - 10-3. The proposed BT+LWD 
employed M = 0.25LF, and 6 = 3.5/M. Regarding 2nd harmonic, the proposed 
BT+LWD achieved an mSDE of 1.746 - 10~? outperforming all its competitors. 
STFT (Track) yielded an mSDE value of 2.145-10~%, while TRIAA (Track) resulted 
to 2.198-10~%. Results were deteriorated when TRIAA was employed. An mSDE 
value of 2.822 - 10-° was reached lagging behind STFT, which achieved a value 
of 2.774- 10-3. BT+LWD mSDE at the 3rd harmonic was measured 0.832 - 107? 
outperforming TRIAA (Track) mSDE, which was measured 1.999 - 107? [26]. The 
conventional STFT yielded an mSDE of 1.900 - 10~? for the 3rd harmonic, while 
an increase was noticed at the 1st and 2nd harmonics, reaching 2.772 - 107? and 
2.774 - 10~3, respectively. STFT (Track) approach reached a value of 1.851 - 107°, 
slightly better than the conventional one. BT reached a value of 1.218 - 1073, 
also lagging behind the proposed BT+LWD. mSDE of the ML approach reached 
a value of 0.760-10~%. In order to determine whether the differences between 
mSDE of the top performing approaches and the rest of the approaches reported 
in Table[2.3]are statistically significant, hypothesis testing was applied [67]. Under 
the assumption that the errors are normally distributed, the statistical test for the 
variances of errors was based on the F-distribution. The sample variances s? of 
errors for each method were calculated. The null hypothesis Ho: s? = s} indicates 
the variances of errors are equal, while the alternative hypothesis indicates the 
opposite. For significance level 5% and number of samples K = 1800, a two-tailed 
test was applied. The test statistic for each pair of approaches is [67]: 

2 
“1 (2.18) 


t= 2° 
55 


Comparisons were conducted between the top performing approach at each 
harmonic and the rest of the approaches w.r.t. the mSDE, as reported in Table 
[2.3] For all pairs of comparisons, gq; was found to be outside the region of acceptance 
0.9116 < q@ < 1.0968. Thus, there is sufficient evidence to warrant the rejection of 
the null hypothesis at significance level 5%. Accordingly, the differences between 
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the variances of errors are statistically significant. Furthermore, hypothesis testing 
at significance level 1% was performed. In all cases, g was calculated and found 
to be outside the region of acceptance 0.89 < q < 1.18. Consequently, there is 
sufficient evidence to warrant the rejection of the null hypothesis at significance 
level 1% for all comparisons and state that the variances of errors are statistically 
significant. The proposed approach BT+LWD outperforms in terms of mSDE the 
TRIAA (Track) and demonstrates statistically significant improvements compared 
to the conventional BT method. The state-of-the-art methods proposed in 
did not provide results regarding mSDE. To conclude, BT+LWD was found top 
performing approach for ENF estimation in the lst and 2nd harmonic and second 
best approach for ENF estimation in the 3rd harmonic w.r.t. mSDE. 
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Figure 2.3: Extracted ENF signal using BT and BT+LWD employing the proposed 
lag window to Data 2. 


2.6.2. Data 2 


An example of extracted ENF signal using the conventional BT and BT employing 
the proposed designed lag window, i.e., BT+LWD is shown in Fig. The refer- 
ence ground truth signal is also overlaid in Fig. The ENF signals are shifted 
by 0.05 Hz up or down from their actual values for illustration purposes, in Fig. 
As can be noticed, around 700 sec., the proposed BT+LWD provides more 
accurate extraction of ENF compared to the conventional BT. The same is indi- 
cated around 1100, 1250, and 1650 sec. These improvements can also be detected 
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by visual inspection of absolute errors between each approach and the reference 
ground truth. These differences tend to zero when BT+LWD was employed, as 


can be seen in Fig. 
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Figure 2.4: Absolute errors between the reference ground truth and the extracted 
ENF for Data 2, employing a) BT and b) BT+LWD. 


Regarding Data 2, only the 2nd harmonic was exploited, as done in literature, 
due to the fact that the other two harmonics were too weak. Data 2 is more chal- 
lenging than Data 1, being closer to real-world practical forensic applications. The 
difficulty in ENF estimation lies in the fact that the pitch frequency for male voice 
(135 Hz for vowel /a/ ) or female voice (200 Hz for vowel /a/) in Data 2 interferes 
with ENF harmonics. The variability of pitch is large both for females and males 
and, thus, having a serious impact on ENF estimation in Data 2. Specifically, the 
range of pitch frequencies is approximately 120 — 350 Hz and 100 — 200 Hz for 
females and males, respectively [68]. The proposed BT+LWD delivered an MCC 
value of 0.9434 outperforming the recent state-of-the-art approach in ENF estima- 
tion, i.e., linear prediction, [80], which yields an MCC of 0.9366. Moreover, the 
proposed approach also outperforms TRIAA, which yields an MCC value of 0.9305. 
The conventional STFT, which is often adopted in the literature, resulted to an 
MCC of 0.9125. The fast implementation of Capon method [85] yielded an MCC of 
0.9351. The ML approach [27] reached an MCC of 0.9319, outperforming the Welch 
method that reached 0.9179. It is worth pointing out that the best perform- 
ing approaches in Data 1, turn out to be fragile in Data 2. MCC of BT estimator 
was measured 0.9179, lagging behind the proposed BT+LWD. Linear prediction 
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(RPCA) and TRIAA (Track) seem to perform better than the proposed BT+LWD, 
but there is a key difference that should be noted. The approaches that outperform 
the proposed BT+LWD include an extra module, namely, RPCA or Track. As a 
result, the proposed method facilitates generalization and can be employed in a 
variety of applications and spectral analysis frameworks. Furthermore, there is a 
significant difference between the proposed BT+LWD and TRIAA. The proposed 


1 
BT+LWD computational complexity is about = E log,(=)+2 log,(2=)|, while the 
bottleneck of the TRIAA approach lies in its high complexity of about @(N?2). 
1 
Employing = = 4N, where N is the segment length, = E log,(=) + 2log,(22) = 


AN | logy(N) + bloes() = 28N log, N. In the same manner, TRIAA computa- 


tional complexity results to @(4N°). 

In order to design an effective lag window able to provide optimal results in 
terms of leakage effects, two critical factors should be taken into consideration. 
Both factors affect crucially the outcome of the spectral analysis. The first factor 
is related to lag window length in samples, M, which should be chosen carefully 
in order to achieve an optimal tradeoff between statistical variance and spectral 
resolution. The second factor that should be taken into consideration is parameter 
9. Thus, the proper choice of @ should offer an optimal tradeoff between the energy 
in the main lobe and the sidelobe. There is not any generally applicable rule of 
thumb. It is worth mentioning that the design parameter 6 should be larger than 
1/M. Otherwise, spectral window design will deteriorate the results and leakage 
reduction will be hindered. We conducted a systematic study to choose the optimal 
values for both parameter @ and lag window length M, as shown in Fig. The 
curves depict the MCC for various values of the aforementioned parameters. The 
optimal value of MCC for the challenging Data 2 is obtained for @ = 2.1/M and 
M =0.5LF,. MCCs for various approaches applied to the 2nd harmonic of Data 2 
are summarized in Table 

As discussed in Sec. in order to assess whether MCC differences for any 
pair of approaches are statistically significant, Fisher’s transformation was applied. 
In all cases, gr was calculated and found to be outside the region of acceptance 
for significance level 5%. Consequently, there is sufficient evidence to warrant the 
rejection of the null hypothesis at significance level 5%. Furthermore, hypothesis 
testing at significance level 1% was conducted. In all cases, gr was calculated 
and found to be outside the region of acceptance. Consequently, there is sufficient 
evidence to warrant the rejection of the null hypothesis at significance level 1%, 
too. Accordingly, BT+LWD is ranked as fourth best approach, behind approaches 
employing an additional module, such as TRIAA (Track), STFT (Track), and 
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linear prediction (RPCA). BT+LWD delivers statistically significant difference 
against linear prediction, fast Capon, ML, TRIAA, BT, and Welch. 


~ 1 1 | 
ns * * a ee ee ee 
ceeay een “che peapmmncg : 
oe 
aw 0.94 a 8 4 
Cc ees ee ee 
ge —-——-9 
o a 
S) ae 
= 0.935 }- oan 4 
®o ae 
6 
0.93 + a 
= 
fo) 
ror 
ian} 
= 0.925 | 
iS 
— 
oO 
O o02 
0.915 | | | | | | it | 


0.4/M 0.6/M 0.8/M 1/M 1.2/M 1.4/M 1.6/M 1.8/M 2/M 2.2/M 


parameter 6 
Figure 2.5: Correlation coefficient of the proposed BT+LWD approach for various 


values of the design parameter @ and lag window length M in Data 2. 


Table 2.4: Maximum correlation coefficient for various approaches applied to 
Data 2. 


Approach 120 Hz 
Linear Prediction [30] 0.9366 
Linear Prediction (RPCA) [30] 0.9764 
ML 0.9319 
Welch[31) 0.9179 
BT 0.9179 
BT+LWD (Proposed) 0.9434 
TRIAA [26] 0.9305 
TRIAA (Track) 0.9907 
STFT 0.9125 
STFT (Track) 0.9857 
Fast Capon 0.9351 


Regarding the mSDE, the proposed BT+LWD, delivered an mSDE of 6.287 - 
10-° outperforming TRIAA, which yielded a value of 7.225 - 10~%. The conven- 
tional STFT yielded a value of 7.948-10~3. BT spectral estimation deliver accurate 
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results, reaching an mSDE of 7.052 - 10~*. Moreover, Welch method achieved an 
mSDE of 7.313 - 10~°. Generally, strong interference in this dataset hinders accu- 
rate ENF estimation. TRIAA (Track), which is based on dynamic programming, 
reached an mSDE of 2.914- 107? and ML approach yielded a value of 3.839 - 107 
[27]. Moreover, STFT (Track) achieved an mSDE value of 3.369- 107%. Results of 
linear prediction approach regarding mSDE were not disclosed in [80]. The mSDE 
measured for various approaches applied to Data 2 is listed in Table 

Statistical tests were conducted to assess whether mSDE differences for any 
pair of approaches are statistically significant at significance level 5% and 1%. 
In both sets of tests, there is sufficient evidence to warrant the rejection of the 
null hypothesis at the aforementioned significance levels. Accordingly, BT-+LWD 
is ranked as fourth best approach behind TRIAA (Track), ML, STFT (Track). 
BT+LWD delivers statistically significant mSDE difference against BT, TRIAA, 
Welch, STFT. 


Table 2.5: Minimum standard deviation of error for approaches applied to Data 2. 


Approach 120 Hz 

STFT 7.948 - 1073 
STFT (Track) 3.369 - 10-3 
ML 3.839 - 10-3 
Welch 7.313 - 10-3 
BT 7.052 - 1073 
BT+LWD (Proposed) 6.287 - 107? 
TRIAA 7.225 - 1073 


TRIAA (Track) 2.914 - 10-3 


2.7 Assessment of spectral estimation methods 


Henceforth, we assess case studies of ENF estimation, resorting to either non- 
parametric spectral estimation methods (e.g., periodogram and refined periodogram 
methods, such as Blackman-Tukey, Welch and Daniell, Capon spectral estimator, 
IAA) or parametric ones (e.g., ESPRIT, Multiple Signal Classification [MUSIC]). 
Fast algorithms for Capon and IAA spectral estimation methods are included 
as well by exploiting the Gohberg-Semencul factorization of the inverse covariance 
matrix [70} [71]. All methods are applied to consecutive frames of data recorded 
from the power mains as well as the audio recording as used in and described 
previously. Motivated by Professor Petre Stoica’s “Spectral estimation is an art”, 
here we put emphasis on the details of band-pass (BP) filtering of raw signal prior 
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to spectral analysis and the fine tuning of parameters involved in spectral analysis 
techniques, which enable us to report more accurate results than those disclosed 
in [26]. Besides the fundamental frequency, ENF extraction is carried out in 
its higher harmonics, which demonstrate a higher SNR than the fundamental one, 
yielding better results, as observed also in [58]. In addition to existing matching 
procedures between the extracted ENF time series and the ground truth one of 
equal length, efficient DTW is employed and assessed, which allows the aforemen- 
tioned time series to have different lengths and eliminates the need to downsample 
the ground truth ENF time series, as is tacitly assumed in [26]. When long station- 
ary segments of the extracted ENF time series (e.g., having duration 20 sec or so) 
are used, a very high matching accuracy is achieved. Such long segments are not 
always available in practice. Accordingly, there is need to assess ENF extraction 
methods from short utterances (see Sec. [2.7.1). To do so, an alteration in an audio 
recording was devised by replacing a short utterance recorded in Europe by the 
same utterance recorded in the US with the latter being perceptually indistinguish- 
able from the former in order to assess the limits of spectral estimation methods 
for ENF extraction. Simple non-parametric spectral estimation techniques, such 
as the periodogram and the Daniell method, are shown to be able to detect the 
aforementioned alteration in the audio recording. 


2.7.1 Utterance dataset 


A third dataset was created by concatenating the 10 recordings uttered by TIMIT 
female speaker ID TEST\DR1\FAKSO to create an audio signal recorded in the 
US. The same 10 utterances were played back from the loudspeakers of a notebook 
connected to the power mains at Thessaloniki, Greece and recorded by various 
mobile phones during the collection of the MOBIPHONE database [73]. The 
European (EU) recording has duration 38.5677 sec. The utterance SA2 “Don’t ask 
me to carry an oily rag like that” in the European recording was replaced by the 
same utterance recorded in the US in a perceptually indistinguishable manner. 
By doing so, the European recording was altered by inserting a 3.6288 sec long 
utterance with nominal ENF of 60 Hz starting from 5.3125 sec and ending at 
8.9413 sec, while the remaining recording has a nominal ENF of 50 Hz. Let us 
refer to the third audio signal as mixed recording. Both the mixed recording 
and the European one have the same duration. The sampling frequency of all 
recordings is 16kHz. Downsampling by a factor of 10 is found beneficial for the 
US and EU recordings. These recordings were filtered by an FIR BP filter of 
order 4801 centered at the second harmonic of ENF with band-pass edges set at 
119.95 Hz and 120.05 Hz as well as 99.95 Hz and 100.05 Hz, respectively. On the 
contrary, the mixed recording was filtered by an FIR BP filter of order 1501 with 
band-pass edges set at 49.95 Hz and 60.05 Hz. 
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2.7.2 Spectral estimation methods for ENF estimation 


Assuming stationarity within the frame, the simplest non-parametric method for 
estimating the ENF is the STFT. That is, frame by frame, the periodogram of each 
frame is computed by squaring the magnitude of the STFT. Let br(wq) be the pe- 
riodogram of the N=L F, samples long rth frame, where We= G4 g=0, Lyees, Q—1 
are the frequency samples and F’, is the sampling frequency. Typically, Q > N, 
ie., Q=4.N. The frequency sample wy,,,,, Which corresponds to the maximum 
periodogram value is extracted as a first ENF estimate. Next, a quadratic in- 
terpolation is employed, which fits a quadratic model to the logarithm of the 
estimated power spectrum about w,,,.. [63]. Hereafter, the aforementioned 
spectral estimation method is replaced by other non-parametric and parametric 
spectral analysis methods. 

A refined periodogram method is the Welch method [64]. In this method, each 
frame is divided into overlapped segments and each segment is multiplied by a 
temporal window. Let y;(t) denote the jth segment. Adjacent segments overlap 


by 1000 samples and each segment has length of M = ~ = Ais samples. The 


Welch estimate of power spectral density (PSD) is given by bw (w) = : 4 o;(w), 


where S = 7 and d;(w) is the windowed periodogram corresponding to y;(t). Here, 
a rectangular window has been employed. The Welch method yields accurate 
ENF estimation without being affected by interferences, especially in the second 
harmonic of the ENF in both datasets. 

The Welch estimator can be related to the Blackman-Tukey (BT) spectral 
estimator for suitable choices of the lag window and the auto-covariance estimate 
[64]. Accordingly, a natural choice for a refined periodogram is the Blackman- 
Tukey estimate given by dgr(w) = ea) #(0) #(@) e “4, where M = % = 
L a for the first and third harmonic and M = N = L F, for the second harmonic 
in both datasets. Another non-parametric method is the Daniell method [64], 
which yields the refined spectral estimate ¢p(w,) = sa aes ; p(w) for dense 
frequency samples w, = a q, gq = 0,1,...,Q—1. Here, the values J = 2 and 
Q=4N =4LF, have been used. 

The periodogram can be interpreted as a filter bank approach, which uses a 
band-pass filter whose impulse response vector is given by the standard Fourier 
transform vector a = [l,e~™,..., eA (ND wT Let R be an estimate of the 
auto-covariance matrix 


- 1 


N y(t) 
an N-m™ 


; Ly* (Qiang (b= m)| (2.19) 
t=m+1 | y(t — m) 


The Capon method is a filter bank approach based on a data-dependent fil- 
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ter [64]: h = Seek. where a(w) = [l,e7”, ...,e7"*]" and [-]* denotes 


conjugate transposition. The Capon spectral estimate is given by: 


jlo) =" 
a*(w) R-! a(w) 
computed for dense frequency samples wy = rae q = 0,1,...,Q—1 with Q = 
300m and m = 10 for the second and third harmonics of Data 1 and Data 2. The 
first harmonic is computed with m = 2 and Q = 5000 m in each case. 

The IAA is a non-parametric alternative to weighted Least Squares method as 
presented in [74]. Let yw = [y(t),..., y(f+ N — 1)|" be the data vector. Let also 
Ene (wi) = [Lee ca  eiwa(N-1)]" be the frequency vector, where gq = 0,1,...,Q-—1 
and Q is the number of frequency samples taken as a multiple of N. Assume 
Fy.q = [fv(w)|...|fv(we-1)]. The sample covariance matrix is given by Ry = 
FyQP@F yg, where Pa is the diagonal matrix whose diagonal elements are ob- 
tained by the squared magnitude of the following estimate 


(2.20) 


f* -1 
=. ive) Ry _¥n (2.21) 
fy (wWq) Ry Ev (wa) 


at the previous iteration, say py = |z7 |’. Both Ry and x, are calculated iteratively 
until practical convergence. For Data 1 and Data 2, N = 2F, and Q = 2N. 
Fast implementations of IAA, referred to as F-IAA, were proposed in (70) [71], 
which exploit the Gohberg-Semencul factorization of the inverse covariance matrix, 
building on the Hermitian Toeplitz structure of the covariance matrix and resorting 
to fast Fourier transforms. Such fast implementations were applied here to reduce 
the extremely high computational requirements of IAA. The same approach was 
also employed to optimize the Capon spectral estimator in Chapter [3] 

ENF estimation can be cast as a line spectrum estimation problem. Accord- 
ingly, one may choose a suitable parametric method for solving the just described 
problem, such as ESPRIT, as was done in [26]. In particular, one has to choose 
the size m of the biased m x m auto-covariance estimate R and the number of 
frequency samples Q to be estimated. Let I,,_; denote the identity matrix of size 


(m—1) x (m—1). The frequencies {wa }e are estimated as — arg(d,), where 


1 
cee are the eigenvalues of the estimated matrix ¢ [64]: 


ob = (SjS1)"* SiS. (2.22) 
Si = [In-1|0] $ (2.23) 
S. = [0|In_1] S (2.24) 
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and § is the matrix having as columns the Q principal eigenvectors of R. Here, 
m=4andQ=2. 

The MUSIC algorithm [64] is another suitable method for ENF extraction. 
First, the biased m x m auto-covariance estimate R is computed. Next, the so- 
called “pseudospectrum” is estimated: 

1 


= ee Cen) oe 


Pyry reveals which sinusoidal components are present in the signal. G denotes 
the matrix made from the eigenvectors of R spanning the subspace of noise. The 
values m = 4 and Q = 2 for Data 1 and Data 2 are used. 

Besides the conventional methods for calculating the accuracy of ENF estima- 
tion one may employ the standard deviation of the error between the true ENF 
and the estimated one. Another method for matching the extracted ENF to the 
ground truth is the DTW. Let d(k,l) = (fx — g:)?. To align the two time series f 
and g, a warping path is found very efficiently using dynamic programming and en- 
forcing proper constraints to evaluate the recurrence which defines the cumulative 
distance y(k,1), for k = 2,..., K: 


d(k,l) + y(k — 1,1) ior 1 = 1 
y(k,l) = d(k,l) +min{y(k,l — a - (2.26) 
y(k—-1,l-1),7(kK-1,)} for? =2,...,K 
assuming that 
_ f d(i,i) for |= 1 
Wad { FA Di eeT AI) $81 Soi (2.27) 


Here, the modified version of the original DTW with novel optimizations intro- 


duced in is used. 


2.8 Experimental evaluation 


Initially, the ENF extraction methods were evaluated using Data 1 and Data 2, 
incorporating two different frame length options, denoted as L; and L2. The selec- 
tion of L, enables direct comparisons with the findings presented in [26], allowing 
for a comprehensive analysis of the proposed methods in relation to prior research. 
On the other hand, the utilization of Ly, as specified in Table facilitates 
an investigation into the performance characteristics of ENF extraction methods 
when longer frame lengths are employed. This approach provides valuable insights 
into the behavior and efficacy of the methods under consideration, allowing for a 
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comprehensive evaluation of their performance. For the F-[AA implementation, a 
2 sec frame length is used due to high time requirements that arise from repeatedly 
matrix inversions in the iterative process. 


Table 2.6: Maximum correlation coefficient for various methods applied to Data 1 
with frame length Ly. 


Algorithm 60Hz 120Hz 180Hz 


aa 0.9886 0.985 0.9957 
Welch 0.9983 0.985 0.9983 
BT 0.9924 0.985 0.9978 


Daniell 0.9906 0.985 0.9977 
Capon 0.9969 0.9909 0.9972 
F-IAA 0.9571 0.9784 0.964 

ESPRIT = 0.9979 0.9913 0.9979 
MUSIC 0.9979 0.9913 0.9979 


For Data 1 using the frame length L,, the accuracy between the ENF signal 
extracted by various spectral analysis methods and the ground truth, measured 
by frequency disturbance recorders with accuracy up to about ~ 0.0005 Hz , is 
summarized in Tables [2.6] and [2.7| as presented in [31]. The maximum correlation 
coefficient is listed in Table while the minimum standard deviation of error 
is gathered in Table It is seen that the first harmonic and the third one is 
more accurately estimated than the second one. For the first and third harmonics, 
the Welch method yields the best performance w.r.t. both figures of merit. Com- 
pared to [26], the accuracy of the ESPRIT method applied to Data 1 is increased 
from 0.947 to 0.9913 for the second harmonic, which is the weakest. Similarly, the 
standard deviation of error is reduced from 6.57 - 107? to 2.901 - 1073. The most 
effective method proposed in is STFT (Track), which uses a discrete dynamic 
programming approach. The maximum correlation coefficient of this method ap- 
plied to Data 1 is 0.9968 for the third harmonic and the standard deviation of 
error is 1.851-10~°. W.r.t. both figures of merit, the methods discussed here out- 
perform STFT (Track). Using DTW, the minimum cumulative distance between 
the aligned time series f and g is obtained when the Welch method is used, as 
shown in Table Matching the extracted ENF to the ground truth with DTW 
is reliable even for the second harmonic, which is the weakest. 

By employing the frame length Lz, one expects more fine spectral resolution at 
the cost of lower time resolution. This is evident in Table [2.9] for Data 1. Frame- 
by-frame, ENF extraction by employing STFT with frame length Ly followed by 
quadratic interpolation yields more accurate results w.r.t. both maximum correla- 
tion coefficient and minimum standard deviation of error than using L;. Welch and 
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Table 2.7: Minimum standard deviation of error for methods applied to Data 1 
with frame length Ly. 


Algorithm 60 Hz 120 Hz 180 Hz 

STFT 2.806-10-? 3.202-10-% 1.303-10-? 
Welch 1.069-10-? 3.202-10°-% 1.069- 10-3 
BT 2.284-10-3 3.202-10-% 1.218- 1073 


Daniell 2542-107? 3.41 -10-° 1,245+10-° 
Capon 1.445-10-3 2.659-10- 1.395- 107% 
F-IAA 1491-10"? 9.263-10-* 8.523-10-% 
ESPRIT 1.198.107? 2.901-10-% 1.202-10~% 
MUSIC 1.198.107? 2.901-10-% 1.208- 107% 


Table 2.8: Cumulative distance between the aligned time series f and g found by 
DTW for various methods applied to Data 1 with frame length Ly. 


Algorithm 60Hz 120Hz 180Hz 


STFT 5.63429 6.05339 2.37785 
Welch 1.98492 3.02637 1.98434 
BT 2.00499 6.7362 1.99956 


Daniell 4.98134 6.33208 2.28147 
Capon 2eatt2 222). “2264 

F-IAA 5.05359 4.11125 5.59216 
ESPRIT = 2.1139 = 2.4609 =. 2.10688 
MUSIC 2.11392 2.46089 2.11371 


Blackman-Tukey methods yield the best results in the second harmonic, although 
periodogram-based methods yield more accurate estimation in the first and third 
harmonics. The F-IAA accuracy for the second harmonic is 0.9784 and generally 
performs accurately although it employs a smaller frame length (2 sec). 

Next, we proceed to ENF estimation from Data 2. In the speech recording, the 
first and third harmonics of the ENF are too weak [26]. Accordingly, we confine 
ourselves to the second harmonic (120Hz). Table summarizes the findings 
for maximum correlation coefficient, as presented in [31]. It is seen that Capon 
method yields the most accurate results. The maximum correlation coefficient 
reported here is greater than 0.8446 reported in for the same length L,. Using 
DTW, the minimum cumulative distance between the aligned time series f and 
g is obtained when the MUSIC and ESPRIT methods are employed. That is, 
DTW succeeds to align the extracted ENF time series to the ground truth, when 
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Table 2.9: Maximum correlation coefficient for various methods applied to Data 1 
with frame length Lp. 


Algorithm 60Hz 120Hz 180Hz 


STFT 0.9916 0.992 0.9964 
Welch 0.9964 0.992 0.9965 
BT 0.9933 0.992 0.9967 


Daniell 0.9926 0.9915 0.9967 
Capon 0.9945 0.9913 0.9948 
ESPRIT 0.9953 (0.9916 ~=—-0.9953 
MUSIC 0.9953 0.9917 0.9953 


strong interferences do exist, as is the case of Data 2. The same performance 
ordering of ENF extraction methods is observed, when a longer frame length L» 
is employed. Periodogram-based methods are ranked top w.r.t. the minimum 
standard deviation of error. 


Table 2.10: Maximum correlation coefficient for various methods applied to Data 2 
for both frame lengths. 


Algorithm 120Hz(L,) 120 Hz (Ly) 


STFT 0.9179 0.9328 
Welch 0.9179 0.9328 
BT 0.9179 0.9328 
Daniell 0.9176 0.9311 
Capon 0.9351 0.9458 
F-IAA 0.9182 - 

ESPRIT 0.9318 0.9444 
MUSIC 0.9318 0.9444 


In Table[2.11] the computation time of various methods for ENF extraction ap- 
plied to Data 1 with frame length Ly is listed. The Capon method is the most time 
consuming with 188 sec, while its fast implementation resorting to the Gohberg- 
Semencul factorization of the inverse covariance matrix requires only 25 sec. STFT 
requires the least time. Increasing frame length to Lz, the order of the most time- 
consuming algorithms remains the same as for Lj. 

To assess the performance limits of the ENF estimation methods, the various 
ENF estimation methods are applied to the 3rd dataset utterance, which comprises 
of the US recording, the EU recording, and the mixed recording. The ENF in the 
US recording is expected to be found in 60 Hz, while the European one in 50 Hz. 
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Table 2.11: Computation time (in sec) of various ENF estimation methods applied 
to Data 1 with frame length [1. 


Algorithm 60 Hz 


STET 0.8836 
Welch 7.8892 
BT 2.7167 
Daniell 1.1282 


Capon 188.2667 
F-Capon 25.8116 
ESPRIT 51.1890 
MUSIC 51.0019 


10Hz 180Hz 


0.9247 0.6713 
7.6732 7.4316 
2.59778 2.9981 
1.9216 1.7166 
97.8226 97.2201 
6.5003 6.4371 
51.7725 51.5155 
51.4458 51.4412 


The ENF embedded in the mixed recording is expected to be found in 50 Hz 
apart from a 4 sec part, where ENF is expected to exhibit an abnormal peak far 
from 50 Hz due to the alteration applied to the EU utterance. ENF fluctuations 
Af > 150mHz are considered to be abnormal. The Daniell method employing 
a 9.5 sec long frame length is found to be able to estimate correctly the ENF in 
the authentic US and EU recordings and to detect the alteration occurred in the 
mixed recording as can be seen in Fig. The vertical lines indicate the starting 
and end time of the alteration. 
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Figure 2.6: (a) ENF estimation in US recording, (b) ENF estimation in EU record- 


ing, and (c) ENF estimation in authentic and altered recordings. 
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Chapter 3 


Efficient Capon-based approach 
exploiting temporal windowing 
for Electric Network Frequency 
estimation 


3.1 Introduction 


As stated in Chapter [2] numerous studies on ENF primarily center around the ex- 
traction of the ENF signal to authenticate content, determine the time of record- 
ings, and establish the recording’s location. However, the majority of these studies 
concentrate solely on the development of novel techniques, neglecting both the 
preprocessing phase and the selection and parametrization of the spectral window. 
Furthermore, a multitude of sophisticated approaches exhibit considerable com- 
putational overhead [30], impeding the estimation of ENF in real-world scenarios 
that require real-time applications. 

In this Chapter, motivated by (70) (76) [77 [78], we introduce a Capon-based 
approach for ENF estimation. The proposed scheme employs a temporal win- 
dow and exploits the Toeplitz structure of the covariance matrix. Together with 
Krylov matrices, Gohberg-Semencul (GS) factorization derives a fast and effective 
approach for ENF estimation. A Parzen window is employed as a temporal win- 
dow, which is shown to yield accurate ENF estimation. Having seen the boost that 
ENF estimation gains due to the proper temporal window selection, we examine 
various windows of different lengths. In parallel, we explore the effect of window 
selection on the ENF extraction, using STFT. 
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3.2 Window selection and estimation procedure 


Window selection was not thoroughly investigated within ENF estimation litera- 
ture. The rectangular window has been used exclusively as a temporal window 
26]. Temporal windowing denotes the multiplication of the time-series with a 
window prior to spectral analysis. On the contrary, a lag window denotes the mul- 
tiplication of the sequence of autocovariances for various lags with a window [64]. 
In this Section, we employ temporal windowing. It is shown through extensive ex- 
periments that the selection of window function pays off. Proper window selection 
is able to provide finer spectral resolution and boost the accuracy of frequency 
estimation. The N-point Parzen window [61] [62] is defined for t € Z as: 


1-6( 575) +(x) 0<|t) <(N-1)/4 
a - 45) (N—1)/4< It) <(N-1)/2 


Along with the Parzen window, we also employ the Kaizer, Hamming, and 
rectangular windows [79]. 
The windowed signal reads as 


we) = (3.1) 


g(t) = y(t) w(t“) 1<t<N (3.2) 


wheret =t/-kF,+1,U =kF,,...,kF,+N—1. Here, m= 10 and N=LF,, 
where L is the frame length in sec. 

In order to estimate the ENF embedded in audio/power recordings, we fol- 
lowed the general scheme proposed in [39] and the parametrization suggested in 
[31]. The first step includes the proper filtering of the raw signal. A sharp zero- 
phase band-pass FIR filter with C, = 1001 and Cy = 4801 coefficients was applied 
around the 3rd harmonic of the signal recorded from power mains and around the 
2nd harmonic of the speech signal, respectively. A more detailed description of the 
dataset is presented in Sec. In both cases, a tight band-pass frequency range 
of 0.1 Hz is employed. The frequency is estimated per frame. Between consecutive 
frames, there exists 1 second shift, which is translated to 441 samples for sampling 
frequency of 441 Hz. Each frame is then multiplied by a temporal window. After- 
wards, the maximum periodogram value of each frame, which corresponds to an 
approximate ENF estimation w,,,,., is derived. In order to obtain a more precise 
estimation of ENF, we employ a quadratic interpolation, as indicated in Sec. [2.5.3] 
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3.3. Proposed approach 


3.3.1 The Capon method 


The periodogram can be interpreted as a filter bank approach, which uses a band- 
pass filter whose impulse response vector is given by the standard Fourier trans- 
form vector [1, ee ee, The Capon method, is another filter bank 
approach based on a data-dependent filter [64]: 


n= — Boal) (3.3) 
a*(w) Ro! a(w) 


where a(w) = [l,e%*,...,e7™""]" and [-]* denotes conjugate transposition. In 
(3.3), R is an estimate of the auto-covariance matrix 


~~ N-m 4 ae 
re ei) 


where y(t) = w(t — n)y(t). The Capon spectral estimate is given by: 


m+ 


oe) = a*(w) R-!a(w) 


(3.5) 


Eq. is computed for dense frequency samples wy = oe g=0,1....,Q—1 
with Q = 4N every sec. ENF is estimated by the angular frequency sample 
Wemax Where the Power Spectral Density (PSD) of Eq. attains a maximum for 
q € (0, 2 —1] and f, = “** F. The Capon method has been found to be able to 
resolve fine details of PSD, making it a superior alternative of periodogram-based 
methods for ENF estimation. 


3.3.2 Fast implementation 


Let N= m+1. y = [y(t),y(t—1),..., y(t—_m)|" £ yx denotes the data vec- 
tor. The proposed approach exploits the Toeplitz structure of the autocovariance 
matrix Rx * R in order to reduce the computational complexity of the inver- 
sions included in the process of Capon spectral estimation. It is not limited to 
Toeplitz structures, but is expanded to low displacement rank matrices, where the 
displacement representation of Rx is defined as [69}: 


Vp,,ptftn = Rg — Dg Ry Dy, (3.6) 
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with Dx being a lower triangular matrix, such as the lag-1 shift matrix: 


The core of this approach lies on the fast and efficient inversion of Ry: using 
the GS factorization. To do so, we exploit the Krylov matrix. Given an arbitrary 
vector vx and a lower triangular matrix Dx, the Krylov matrix is defined as 
follows: 


Hy (vg,D x) = lug, Dyvg,-- DX v5] (3.7) 


Exploiting the Krylov matrix and the unit lower shift matrix Dx, the inverse 
of Rx is formulated as [69] (64]: 


R5! = He (18 Dx) HE (1 De) — Hy (Br Du) HG (Bx-Dy) (38) 


N 


with 


Yn = = a (3.9) 


a 9 


On [WN-1 


where the vector Wy_, is the conjugate transpose version of wx_, with the order 
of its elements reversed. 
The covariance matrix Rx is partitioned as: 


iy = sn | (3.11) 


naa Ry, 
where fo is the element (1,1) of Rx, px_, is a column vector with entries the rest 
of the elements of the first column of Rx, while Px_, Tepresents the conjugate 
transpose row vector of px_y. 
The parameters yx, and 6x, are computed by solving the following system for 


w_, (70) (80): 


Ry, Wy_1 = -PN-1 (3.12) 


— a aA 
and ay _, = pot+Px_.Wx_o- 
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In order to efficiently calculate the denumerator of Eq. (3.5), we can define the 
polynomial related to Ry as 


ig“) Rage y= Se?” (3.13) 


a4 


y(z) =a 
Then the polynomial in the denominator of (3.5) is rewritten as 


~ i722 k 
Yo(w_) = SY) cee’? (3.14) 


K=—™m™m 


and can be computed via applying zero padded FFT to the coefficients {c,,}?" 


Let c 2 (a= eer ae Co] Then, it follows from [76] that: 
c= 4% (V5, Dy) Vx — 4x (On, Du) Vin (3.15) 
Eq. (3.15) constitutes a summation of Toeplitz matrix-vector products and 


can be computed by applying FFT. Since Rx! is Hermitian and {yp(w,)}°)) are 


q=0 
real-valued, from (3.14) we have 


64 =, 1 = 0,1, M1, (3.16) 
and this concludes the fast computation of the {c,}/"_,,, in (3.14). 


3.4 Experimental evaluation and statistical tests 


3.4.1 Results 


To assess the effectiveness of the proposed approach, we conducted experiments 
using the two datasets described in Sec. The proposed approach was com- 
pared against state-of-the-art methods in ENF estimation. For each dataset, a 
30-minute recording was utilized in the experiments to evaluate the performance 
and compare the results against existing approaches. This duration was chosen 
to ensure sufficient data for analysis and to enable a comprehensive evaluation of 
the proposed approach’s performance in comparison to other methods. Regarding 
Data 1, four different frame lengths of 1, 5, 10, and 20 sec were used for ENF esti- 
mation. As stated in Table|3.1| by applying the procedure in Sec. [3.3]a correlation 
coefficient of 0.9990 is obtained, when a frame length of 20 seconds is employed. 
This value exceeds the state-of-the-art linear prediction estimation presented in 
30], which reaches 0.9984 even though their value is not purely from the linear 
prediction method due to the fact that there has been made a denoising procedure 
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before. Moreover, the aforementioned value of 0.9990 overcomes the Maximum 
Likelihood (ML) method and the Welch one, which was properly parametrized in 
[31]. An interesting fact regarding our proposed method lies in the frame length. 
When shorter frame lengths of 5 and 10 sec are employed, the accuracy gets higher 
and reaches 0.9991, while for the other methods accuracy drops as the frame length 
gets shorter. Due to the need for fast and accurate ENF estimation in real world 
applications, it is seen that even when an 1 second frame length is adopted, an ac- 
curacy of 0.9990 is obtained. Despite the very short frame length, there is a high 
resolution enabling the accurate estimation of the ENF. For all other methods, 
accuracy drops below 0.99 within this setup and reaches 0.8255 when weighted 
spectrogram is used. 

In addition to the comparisons between the different approaches for ENF es- 
timation, a systematic study was carried out in order to examine the impact the 
different windows have on ENF estimation. Four different windows along with four 
different frame lengths were employed, as shown in Table[8.2| As stated before, the 
Parzen window yields the highest accuracy among approaches and is not affected 
by the frame length at all. This makes it the best choice in ENF estimation ap- 
plied to a recording from power mains. The performance of the Hamming window 
is similar, which makes it a good alternative to Parzen window. On the contrary, 
Kaiser and rectangular windows yield remarkable results only when a larger frame 
length is employed. Even when a 10 seconds frame is used, Kaiser and rectangular 
windows are not able to provide accurate ENF estimation. Therefore, the choice 
of the window is not a trivial task and impacts accurate ENF estimation. 

Employing the proper window is essential in every approach chosen for ENF 
estimation. In order to demonstrate the crucial role of window selection, we present 
the accuracy obtained using the trivial method of the STFT for various windows. 
As shown in Fig. Parzen window yields highly accurate results even when an 
1 second frame length is employed. This accuracy approaches 0.9990, which means 
that the STFT approach with proper temporal window can outperform state-of- 
the-art methods. It is also evident that even though high accuracy is achieved 
when long frame lengths are used, the selection is very important when shorter 
ones are employed. Additionally, it is demonstrated that longer frame lengths do 
not necessarily imply better accuracy in terms of correlation coefficient. 


Table 3.1: Correlation coefficient for various frame lengths - Data 1. 


Frame length (in sec) 1 5 10 20 

Proposed with Parzen window 0.9990 0.9991 0.9991 0.9990 
ML 0.8826 0.9852 0.9953 0.9977 
Linear Prediction [30] 0.9651 0.9959 0.9976 0.9984 
Welch{31] 0.9847 0.9989 0.9989 0.9983 
Weighted Spectrogram 0.8255 0.9873 0.9944 0.9966 
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Figure 3.1: STFT using different windows for ENF estimation in Data 1. 


Table 3.2: Correlation coefficient for various windows - Data 1. 


Frame length (in sec) 1 5 10 20 

Parzen 0.9990 0.9991 0.9991 0.9990 
Hamming 0.9989 0.9991 0.9990 0.9988 
Kaizer 0.0086 0.0495 0.0438 0.9976 
Rectangular 0.0047 0.0798 0.0689 0.9975 


The second dataset (Data 2) comprises of speech recordings in which interfer- 
ence exist and suffers from low SNR. In Data 2, we studied the second harmonic, 
where a higher SNR permits us to obtain reliable results. Our proposed approach 
(Sec. is extremely fast due to the fact that it exploits Krylov matrices and 
the Toeplitz structure of the covariance matrix. It also provides high frequency 
resolution, yielding an accuracy of 0.9351 in terms of correlation coefficient. This 
result is obtained using a 33 sec frame length and a rectangular window. Our 
approach outperforms the existing ML approach and the high resolution MUSIC 
method, as demonstrated in Table It is lagging behind the pure linear pre- 
diction method without the additional denoising procedure w.r.t. the correlation 
coefficient for about 0.0015, but is still meant to be an efficient alternative taking 
into account the fact that linear prediction is an iterative approach, which includes 
large matrix inversions in each iteration. When speaking of large datasets, like the 
one we discuss in this study, linear prediction is much slower than the proposed 
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approach. Consequently, the trade-off between accuracy and time complexity of 
our approach constitutes a useful tool in ENF estimation, no matter what the 
nature of the recordings and the duration are. 

A systematic study was also carried out in order to examine the impact of 
the window on the speech recordings of Data 2. Four windows were employed, as 
done for Data 1, with four different frame lengths, as presented in Table In 
speech recordings, increasing frame length provides better results at the expense 
of time requirements. It is also evident that for very short frame lengths (i.e., 5 
sec) accuracy is deteriorating rapidly. However, the window, which is going to be 
selected, plays a key role in the final accuracy of ENF estimation. Although all 
four choices provide acceptable results in terms of accuracy, it is the rectangular 
window, which outperforms its competitors now. 

In order to determine whether the correlation coefficient of the proposed method 
is significantly different from that of other methods (Hy: c, 4 cg), hypothesis test- 
ing was applied. Fisher transformation, z = 0.51n a was employed for each pair 
of correlation coefficients under examination |66]. For significance level 95%, the 
test statistic g = Vn —3(2 — 22) was outside the region —1.96 < q < 1.96, for 
n = 1800. Thus, the null hypothesis was rejected for every pair of comparisons. 
Accordingly, the differences between the correlation coefficients were significant at 
confidence level of 95%. 


Table 3.3: Correlation coefficient for various frame lengths - Data 2. 


Frame length (in sec) 10 33 

Proposed with rectangular window 0.8663 0.9351 
ML 0.9059 0.9319 
Linear Prediction [30] 0.9213 0.9366 
MUSIC [BI] 0.9087 0.9318 
Weighted Spectrogram 0.8787 0.9125 


Table 3.4: Correlation coefficient for various windows - Data 2. 


Frame length (in sec) 5 10 20 33 

Parzen 0.7063 0.7773 0.8413 0.8785 
Hamming 0.7453 0.8128 0.8703 0.8987 
Kaizer 0.8081 0.8663 0.9035 0.9228 
Rectangular 0.8092 0.8663 0.9036 0.9351 
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Chapter 4 


An automated approach for 
Electric Network Frequency 
estimation in static and 
non-static digital video recordings 


4.1 Introduction 


The vast amount of information contained in multimedia content, i.e., audio, im- 
age, and video recordings, has prompted perpetrators to commit forgery attacks 
distorting the digital content. Digital forensics advancements have experienced 
an exponential growth in the last decades, as digital manipulation methods are 
constantly evolving and affecting various aspects of social and economic life. To 
this end, emphasis has been put on advancing emerging technologies in the field of 
digital forensics, which can efficiently verify the authenticity of multimedia content 
and cope with multimedia forgeries. A comprehensive survey of image and video 
forensics techniques can be found in and |82], respectively. 

This Chapter delves into a comprehensive analysis of the utilization of ENF 
for indoor video recording, particularly in environments where fluorescent light is 
present. The primary objective is to investigate the role of ENF in multimedia 
authentication both in static and non-static video recordings. By scrutinizing the 
impact of fluorescent light on ENF-based techniques, we aim to elucidate the po- 
tential benefits and challenges associated with employing ENF for accurate and 
reliable multimedia authentication in indoor settings. The ENF can be captured 
in video recorded in indoor environments due to fluorescent illumination. Ilumi- 
nation intensity variations resemble ENF variations in the power grid [83]. Thus, 
ENF estimation can be exploited for multimedia authentication, time-stamp ver- 
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ification, and forgery detection in audio and video recordings. Until recently, the 
research has mainly been focused on audio recordings, where many advances have 
been achieved. 


4.2 Related work 


Although a lot of attention has been paid to ENF estimation in audio recordings, 
it was found that the ENF can also be traced in video recordings. The ENF 
can be estimated in videos captured under the illumination of fluorescent bulbs in 
indoor environments . ENF variations caused by power grid networks affect the 
illumination intensity and each frame captures a time-snapshot of ENF. ENF video 
estimation approaches can be divided into two categories based on the recording 
sensor type. The first category consists of videos captured by charge-coupled 
device (CCD) sensors, which employ a global shutter mechanism. This type of 
sensors capture instantly all pixels of a frame. Thus, each frame depicts a specific 
time snapshot. When CCD sensors are used, the state-of-the-art approach for 
ENF estimation is based on averaging all pixels in each frame of static videos 
[83]. For non-static videos, state-of-the-art ENF estimation suggests averaging 
all steady pixels in each video frame. The second category consists of videos 
captured by complementary metal oxide semiconductor (CMOS) sensors. Such 
sensors employ a rolling shutter mechanism, which acquires a row at a time in each 
video frame [83] [84]. A comprehensive analysis of the rolling shutter effect was 
conducted in [85]. An analytical model for videos captured using a rolling shutter 
mechanism was developed, demonstrating the relation between ENF variations 
and the idle period length. ENF-based video forensics are not trivial, especially 
for non-static video recordings. ENF presence detection based on superpixels (i.e., 
multiple pixels) was proposed in |86]. The proposed approach could be applied to 
static and non-static videos captured by both CCD and CMOS camera sensors. 
Recently, a method for ENF estimation in non-static videos was presented in [87]. 
This method could be accurately utilized in video recordings, whose frame rate 
is unknown. The ENF was applied to video recordings for camera identification 
in [88]. Video synchronization can be efficiently achieved by employing the ENF. 
Video synchronization methods were developed in that were based on ENF 
signal alignment. A forgery detection algorithm based on ENF signal was proposed 
in without needing any ground truth signal. A technique to detect false frame 
injection attacks in video recordings using the ENF was discussed in [90]. ENF was 
employed to authenticate video feeds from surveillance cameras. ENF estimation 
and detection in single images captured by CMOS camera sensors constitutes a 
challenging task. Novel investigations taking into consideration the ENF strength 
were described in [91]. ENF estimation in videos with rolling shutter mechanism 
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was presented in [92]. Both parametric and non-parametric spectral estimation 
methods were combined for accurate ENF estimation. 

In this Chapter, inspired by [86], an automated approach is proposed for ENF 
estimation from CCD video recordings based on Simple Linear Iterative Clustering 
(SLIC) [93]. Areas of common characteristics that include superpixels are gener- 
ated using the SLIC algorithm. The proposed approach takes into consideration 
only the superpixels whose average intensity exceed a predefined threshold. It 
is shown that within these areas, the embedded ENF is not hindered by any in- 
terference, resulting to more accurate estimation regardless of whether the video 
recording is static or not. The novelty of the proposed approach lies in 1) the cre- 
ation of areas with similar characteristics and 2) the estimation of ENF exploiting 
only these areas in contrast to what has been done for ENF estimation in videos 
so far. 

The motivation for the development of the proposed approach is to mitigate 
the interference and noise caused by textures, shadows, and brightness that are 
present in real-life applications, such as surveillance videos. By doing so, we ad- 
vance the related literature, where static videos are mostly used, such as the ” white 
wall” recordings. From a practical point of view, the proposed approach enables 
automated ENF estimation regardless of whether the video recording is static 
or non-static. Thus, it can be applied to practical forensics applications, such as 
multimedia content authentication, indicating the place where a recording was cap- 
tured, and revealing the time the recording was made. It is worth noting that the 
proposed approach is tested on real-world static and non-static videos of escalating 
difficulty in order to simulate real conditions. The MCC between the estimated 
ENF and the reference signal is employed to measure ENF estimation accuracy. 
Moreover, hypothesis testing is performed to assess the statistical significance of 
the improvements delivered by the propose approach. 


4.2.1 ENF estimation 


It has been shown, recently, that ENF traces can be embedded in video recordings 
due to light intensity variations. Such recordings are captured in the presence of 
fluorescent light or the light emitted by incandescent bulbs [89]. The light intensity 
is directly connected to electric current and its nominal frequency is influenced by 
the ENF signal, fluctuating at twice the nominal frequency of ENF, i.e., 100 Hz 
in Europe, and 120 Hz in the U.S. The lower temporal sampling rate of cameras 
capturing video recordings compared to frequency components in light flickering 
results to a significant aliasing of ENF signals. Thus, ENF is present at different 
frequencies than those appearing in audio recordings. These frequencies can be 
derived by applying the sampling theorem [79]. Besides the fundamental frequency 
of power mains, it is the frame rate of video camera that influences the aliased 
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Table 4.1: Aliased frequencies of ENF w.r.t. camera frame rate and fundamental 
ENF at power mains frequency [83]. 


Power mains (Hz) Camera frame rate f, Aliased base 
(fps) frequency (Hz) 

50 29.07 10.09 

50 30 10 

60 29.97 0.12 

60 30 0 


base frequency of ENF in video recordings [83]. The aliased frequency f emanated 
from fluorescent illumination is given as follows [94]: 


fe=|fi-yfs| < h (4.1) 


where f, denotes the sampling frequency of camera, f; denotes the frequency of 
light source illumination, and 7¥ is an integer. Aliased frequencies of ENF based on 
different camera frame rates and power main frequencies are listed in Table 
The ENF estimation procedure in video recordings differs slightly from that 
employed in audio ones. The difference is in the pre-processing stage. Two cases 
are examined depending on whether the video recordings are static or non-static. 
Regarding static videos, the state-of-the-art suggests to compute the mean 
intensity of each frame, transforming the two-dimensional (2D) images into an 1D 
time-series. It is worth noting that the majority of experiments conducted so far 
employ static recordings of white wall videos. Here, we employ a variety of static 
recordings different than white wall videos, as detailed in Sec. [4.4.1] Regarding 
non-static videos, the current practice is to compute the mean intensity of rel- 
atively stationary areas of each frame. In both categories, an 1D time-series is 
formed and the estimation procedure follows that employed for audio recordings. 
This time-series is treated as a raw signal that is passed through a zero-phase 
bandpass filter around the frequencies where ENF appears. Specifically, the band- 
pass edges of the filter are set at 9.9 Hz and 10.1 Hz when the nominal frame rate 
is 30 Hz despite the fact that the nominal frame rate was claimed to be 29.97 Hz 
in [87]. The bandpass edges employed herein accommodate also the aliased base 
frequency, which corresponds to a nominal frame rate of 29.97 Hz. The filtering 
procedure is of crucial importance in ENF estimation [31]. Subsequently, the sig- 
nal is split into V overlapping segments of L samples size. Each segment is shifted 
by Ss from its immediate predecessor and is multiplied by an L-size rectangular 
window. Any temporal window can be employed in the pre-processing procedure. 
Afterwards, the prevalent frequency of each segment is estimated by spectral esti- 
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mation. Frequently, a quadratic interpolation is used to overcome the interference 
that hinders the entire procedure and results to more precise ENF estimation [26]. 
Here, the estimated ENF signal f is calculated by employing shifts of 1 s (i.e, 
S=1): 


4.3 Proposed method 


In this Chapter, a video ENF estimation approach for static and non-static video 
recordings is presented. It is based on the SLIC algorithm for image segmen- 
tation. The SLIC algorithm generates superpixels, which are regions of similar 
characteristics. The idea behind the proposed approach is that in regions having 
high luminance levels and not hindered by shadows or dark areas, light source 
variations can easily be detected and, thus, the ENF signal can be estimated more 
accurately. The first step of the proposed approach generates N regions with sim- 
ilar characteristics in the first frame of a video recording. Afterwards, the mean 
intensity values ¢,(1), n = 1, 2,..., N of all regions in the first frame are com- 
puted and only those, exceeding a predefined threshold 7 are retained. Let ¢(1) be 
the vector with elements ¢,,(1). If N = |{n : G,(1) > 7}| denotes the size of region 
mean intensity values exceeding the threshold, then the mean intensity value for 
the first frame is given as follows [95]: 


(1) = = J Ga(1u(Gnlt) = 7) (4.2) 


where u(¢,,(1) — 7) denotes the Heaviside function. 

In the next step, the generated regions from the first frame are located in 
all A frames of the video recording. For a video recording having duration 12 
min, A = 21,600 frames. Employing these regions, the mean intensity values of 
the regions are computed and, then, the mean intensity value in each frame is 
calculated, as in (4.2). In this way, each video frame is represented by an intensity 
yalue- w(t), b= 1,2, - a2, 1 

A non-parametric, namely the STFT, and a parametric method, i.e., the Es- 
timation by Rotational Invariant Techniques (ESPRIT), were employed for ENF 
estimation. Hereafter, the frames, indexed by t, will be referred to as samples. 

The STFT is one of the most common methods in time-frequency analysis of 
signals. Assuming stationary within the short-time segments of the signal, the 
Discrete-Time Fourier transform is computed for each time segment [96]: 


co 


Xi(w) = > a(t)w(t -IG)e™ (4.3) 


t=—0o 
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where w(t) denotes a window function of length L, X/(w) is the discrete-time 
Fourier transform of the windowed data centered around IG, and G = Sf, is 
the hop size in samples. The proper selection of window function constitutes a 
very important issue in STFT and, generally, in the majority of time-frequency 
analysis methods. This is because an optimal trade-off between time and frequency 
resolution is sought. Let ¢j(w,) x |X;(w,)|? be the periodogram of the L = D f, 
samples long /th segment, where w,, K = 0, 1, ..., Q@—1 are the frequency samples 
with Q = 4L. Specifically, the frequency sample w, that corresponds to the 
maximum periodogram value is extracted as a first ENF estimate. Afterwards, a 
quadratic interpolation is employed to obtain a refined ENF estimate. 

ESPRIT is also employed to estimate the ENF signal. Let R be the sample 
covariance matrix 


: 
Al ea 
LC S| &(t)#" (t) (4.4) 
where ' stands for transposition and 
#(t) S [a(é), x(t —1),..., cé-—m+1)]". (4.5) 


Let $ be the subspace spanned by the W principal eigenvectors of R. Let S$; = 
[Lm—1|0] S and S_ = [0|L,,_1] S, where I,,_; denotes the (m—1) x (m—1) identity 


matrix. ESPRIT estimates the angular frequencies Lunn 


--1 25 — arg(dx), where 


{op are the eigenvalues of the estimated matrix @ [64]: 

ob = ($1 $1)" STS: (4.6) 
The frequency — = arg(é,) fs (in Hz) which is closest to the aliased base frequency 
is the ENF estimate. Here, m = 10 and W = 3. 

The proposed approach combines the generation of the mean intensity time- 
series x(t) with either the ESPRIT or the STFT method. An outlook of the 
proposed approach is depicted in Algorithm [1] 

In Section Fisher’s transformation was employed to assess whether the 


pairwise differences between the MCC delivered by the proposed approach and 
that of state-of-the-art one are statistically significant at significance level 5%. 


4.4 Experimental evaluation and statistical tests 
The accuracy of ENF signal estimation is greatly influenced by the nature of video 
recordings. In static videos, the presence of ENF remains unaffected, resulting 


in higher estimation accuracy compared to non-static videos. In the latter case, 
continuous motion poses a challenge and adversely affects the accuracy of ENF 
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Algorithm 1: Proposed SLIC-based approach for ENF estimation in 
video recordings 
Inputs: Number of video frames A, number of superpixels NV, threshold 7, 
cut-off frequencies, segment duration LZ, number of overlapping segments V, 
ESPRIT parameters m and W, and reference ground truth. 
Output: Estimated ENF vector f. 


1: Perform SLIC in the first frame of the video recording to generate N regions 
of similar characteristics and luminance, i.e., superpixels. 

2: Compute mean intensity values ¢,,(1) of each generated region. 

3: Compute the mean intensity values of regions exceeding threshold 7 in the 
computation of x1. 

4: Locate the generated regions in the A — 1 remaining frames and repeat steps 
2-3 to compute x(t), t = 2, 3,..., A. 

5: Having computed the 1-D time-series x(t), x(t) is bandpass filtered using the 
cut-off frequencies described in Sec. [4.2.1] 

6: The filtered signal is split into V overlapping segments. Each segment is 
obtained by multiplying the filtered signal with an L-size rectangular window. 
Any segment is shifted from its immediate predecessor segment by S's. 

7: In each segment, the prevalent frequency derived by the ESPRIT method is 
employed as the ENF estimate. In case of STFT, the frequency that 
corresponds to the maximum periodogram value is extracted as the ENF 
estimate. 

8: Compute the MCC between the estimated ENF and the reference ground 
truth. 
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Table 4.2: Types of six video recordings employed for ENF estimation. 
Video name _ Video type 


mov, static 
move static 
mov3 non-static 
mov, non-static 
MOVs non-static 
MOvV¢ non-static 


estimation. To address this difficulty, numerous approaches have been developed 
with the goal of improving ENF estimation accuracy in non-static video scenarios. 
These approaches aim to mitigate the impact of motion and enable more robust 
and accurate estimation of the ENF signal. For this reason, the state-of-the-art 
approach for ENF estimation in videos |83], which employs intensity averaging 
with MUSIC method, examines whether the video to be analyzed is a static or 
a non-static one. For brevity, from now on the state-of-the-art approach 
for both static and non-static videos will be referred to as MUSIC. The proposed 
approach employs either ESPRIT or STFT after SLIC. The novelty of the proposed 
approach lies in the fact that CCD sensors capture a time snapshot using a global 
shutter mechanism, which makes the distinction between static and non-static 
video obsolete. Thus, the proposed approach is applied regardless of whether the 
video recording is a static or a non-static one. It is tested on six video recordings of 
escalating difficulty from the publicly available dataset [97]. These recordings are 
either static and non-static ones. A reference ground truth signal is also available. 
The results are compared to those obtained by MUSIC [83]. The video recordings 
of the dataset employed in this Chapter are publicly availabl] 


4.4.1 Dataset description 


Six different video recordings were recorded in Vigo, Spain at a nominal ENF 50 
Hz. Two different cameras were employed, namely, a GOPRO Hero 4 Black and 
an NK AC3061-4KN without an anti-flicker filter [97]. The video recordings are 
named as mov;, 27 = 1, 2, 3, 4, 5, 6 and their types are listed in Table [4.2] 
Recording mov, is closer to what is known as ” white wall” video in the litera- 
ture. Going a step further, it depicts a flat coloured wall of low brightness. This 
kind of recording can be exploited to evaluate whether ENF variations can be 
embedded and, subsequently, estimated in such a static and seemingly noise-free 


Thttps://tinyurl.com/34dmy8mh 
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environment. mov, is also a static video, which contains regions with different 
textures, brightness, and shadows. This video is more challenging than moyy. 
mov3 can be categorized as non-static video. It starts showing a white wall and a 
wooden table. Then, an object is placed on the table and a human hand rapidly 
shakes white papers at regular intervals on the right region of the recording. mov, 
is a non-static video, where human movement appears. It is a complex recording 
and consists of several textures. It takes place within an office, where a human is 
constantly moving. Both the background wall and the floor are captured. movs 
constitutes one of the most challenging recordings, which resembles a real-life scene 
captured by a security camera. It is recorded within the complex environment of 
aroom. The scene contains several objects with different colors and textures. The 
most significant challenge of mov; is that the movement affects the majority of the 
frames and more than 50% of the pixels of each frame. movg represents another 
challenging video recording, which contains a constant movement of a person in- 
side a room. The movement takes place close to the camera, affecting most pixels 
in each frame. In all cases, the camera is fixed. Sample frames of the video record- 
ings are depicted in Fig. The estimated ENF signal is compared against a 
reference ground truth obtained from power mains. 
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Figure 4.1: Sample frames of the six video recordings employed. (a) On the top 
left, there is a snapshot of a static video, recording a dark wall, while (b) on the 
top middle there is a snapshot of a static video, which captures the interior of a 
room. (c) On the top right, a table is depicted on which an object is placed. (d) 
On the bottom left, a person is constantly moving in an office. (e) On the bottom 
middle, there is a room with different textures and a person is moving covering 
many times a large part of the camera field. (f) On the bottom right, a person is 
moving in front of the camera lens inside a room. 


4.4.2 Results 


The approach detailed in Sec. was applied to the six video recordings and 
the estimated ENF was compared against the MUSIC for static and non-static 
videos. Particularly, for static videos the state-of-the-art approach suggests 
averaging intensity values in each frame, while for non-static videos intensity values 
are averaged within relatively static regions of each frame. In all comparisons, a 
rectangular temporal window was employed. The predefined threshold 7 was set at 
MV/3, where MV is the median of N average intensity values within the generated 
regions in each frame. All approaches were implemented in MATLAB 2016a. A 
64-bit operating system with an Intel(R) Core(TM) i7 — 5930K CPU at 3.5 GHz 
was used in the experiments conducted. 


81 


Chapter 4. ENF estimation in static and non-static digital video recordings 


4.4.3. ENF estimation in static video mov; 


The ESPRIT method was tested for ENF estimation in mov,;. The static nature 
of mov, enables, an accurate ENF estimation. The proposed approach, which em- 
ploys the SLIC-based segmentation and intensity averaging resulted to an MCC 
of 0.9926, outperforming the MUSIC [83] where the MCC was measured to be 
0.9658. When STFT was employed, the MCC was found to be 0.8662. Differ- 
ent segment durations in ENF estimation affect the results obtained. The MCC 
was computed for various segment durations D, as depicted in Fig. |4.2} When a 
segment duration of 1 s was employed, the proposed approach using the ESPRIT 
worked satisfactorily, yielding an MCC of about 0.79, while the MCC was mea- 
sured to be about 0.5, when the MUSIC [83] was used. The performance of ENF 
estimation depends also on the filter order v of the bandpass filter. The MCC 
is plotted versus various filter orders in Fig. The top performance of the 
proposed approach, employing the ESPRIT, is achieved when v = 111. Despite 
mov, is a trivial recording, the proposed approach offers significant improvements 
in ENF estimation accuracy against the method in [83]. The computational time 
of the proposed approach employing SLIC+ESPRIT was about 506.8 s, while the 
MUSIC required about 492.5 s. 


‘Segment duration D (s) 


Figure 4.2: Maximum correlation coefficient of the proposed approach employing 
SLIC+ESPRIT for various segment durations against the MUSIC (mov1). 
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Figure 4.3: Maximum correlation coefficient of the proposed approach employing 
SLIC+ESPRIT for various filter orders and segment durations (mov1). 


4.4.4 ENF estimation in static video mov» 


The static recording movg is more challenging than mov, due to different textures 
and various levels of luminance. The STFT was employed for ENF estimation 
yielding an MCC of 0.9704. The MUSIC resulted to an MCC of 0.9466. The 
ESPRIT method achieved an MCC of 0.9526. In this case, there is a strong cor- 
relation between the proposed approach and the method in [83] w.r.t. segment 
duration. Smaller segment durations resulted to lower MCCs in both approaches. 
For longer segment durations, both approaches yielded a higher MCC, as shown 
in Fig. Similar behaviour was noticed when different filter orders were em- 
ployed. When the bandpass filter order vy = 81 was used, the top performance 
was observed. The MCC of the proposed approach employing SLIC+STFT for 
various values of bandpass filter order and segment duration is plotted in Fig. 
4.51 The proposed approach employing SLIC+STFT required about 627.2 s. The 
computational time of the MUSIC one was approximately 704.7 s. 
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Figure 4.4: Maximum correlation coefficient of the proposed approach employing 
SLIC+STFT for various segment durations against the MUSIC (mov2). 
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Figure 4.5: Maximum correlation coefficient of the proposed approach employing 
SLIC+STFT for various bandpass filter orders and segment durations (mov2). 
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4.4.5 ENF estimation in non-static video mov3 


The STFT method was employed for ENF estimation. movy is a challenging video 
depicting movements and different textures. Thus, ENF estimation is a non-trivial 
task. The STFT achieved an MCC of 0.9877, outperforming the method in [83], 
which reached an MCC of 0.9191. The ESPRIT method resulted to an MCC of 
0.7271. As can be seen in Fig. the longer segment duration the more accurate 
ENF estimation. The top result w.r.t. the MCC was measured for bandpass filter 
order vy = 51. In mov3, improper values of filter order can lead to a significant 
reduction in MCC. Increasing the segment duration usually results to a more 
accurate ENF estimation w.r.t. the MCC. In this experiment, it has been noticed 
that when a large value of bandpass filter order is employed, increasing segment 
duration deteriorates estimation accuracy. The impact of filter order in MCC 
is demonstrated in Fig./4.7} The computational time of the proposed approach 
employing SLIC+STFT was about 468.5 s, while the MUSIC required 531.4 
S. 
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Figure 4.6: Maximum correlation coefficient of the proposed approach employing 
SLIC+STFT for various segment durations against the MUSIC (mov3). 
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Figure 4.7: Maximum correlation coefficient of the proposed approach employing 
SLIC+STFT for various bandpass filter orders and segment durations (mov3). 


4.4.6 ENF estimation in non-static video mov, 


The non-static video mov, captures a much more complex scene, where the hu- 
man presence and movement is closer to real-life applications than the previous 
videos. Here, the STFT was employed for ENF estimation. The STFT yielded 
an MCC of 0.9837, which outperformed the MUSIC, which attained 0.8700 [83]. 
When the ESPRIT method was used, an MCC of 0.7605 was measured. The top 
performance was achieved for v = 51. The MCC of the proposed approach em- 
ploying SLIC+STFT for various segment durations is shown in Fig. MCC 
values of different segment durations and various bandpass filter orders are plotted 
in Fig. |4.9| The computational time required by the proposed method employing 
SLIC+STFT was about 423.3 s, while the execution of the MUSIC [83] required 
487.2 s to conclude. 
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Figure 4.8: Maximum correlation coefficient of the proposed approach employing 
SLIC+STFT for various segment durations against the MUSIC (mov4). 
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Figure 4.9: Maximum correlation coefficient of the proposed approach employing 
SLIC+STFT for various bandpass filter orders and segment durations (mov4). 
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4.4.7 ENF estimation in non-static video mov; 


Video mov; is one of the most challenging recordings. It resembles a scene captured 
by a security camera. Here, the STFT was employed for ENF estimation. The 
STFT achieved an MCC of 0.9432, outperforming the MUSIC whose MCC was 
measured to be 0.8441 [83]. When the ESPRIT was employed, the MCC reached 
0.8959. The MCC of STFT is plotted for various segment durations against the 
MUSIC in Fig. |4.10} When different values of bandpass filter order were 
employed, a longer segment duration was found to yield an increase in MCC, as 
can be seen in Fig. On the contrary, for a segment duration longer than or 
equal to 40, a plateau is noticed. The top MCC was achieved for a bandpass filter 
order of vy = 511. The execution of the proposed approach employing SLIC+STFT 
required 523.4 s to conclude, while the computational time of the MUSIC was 
about 602.6 s. 


—— STFT 
— — MUSIC [3] |4 


i 1 L 
0 10 20 80 90 


Segment duration D (s) 


Figure 4.10: Maximum correlation coefficient of the proposed approach employing 
SLIC+STFT for various segment durations against the MUSIC (mov5). 
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Figure 4.11: Maximum correlation coefficient of the proposed approach employing 
SLIC+STFT for various bandpass filter orders and segment durations (mov5). 


4.4.8 ENF estimation in non-static video mov¢ 


Similarly to video mov;, movg constitutes a challenging real-world indoor record- 
ing. This recording resembles a scene captured by a hidden camera under special 
conditions, which could hinder ENF estimation accuracy. Nevertheless, the pro- 
posed approach employing STFT resulted to an MCC of 0.9309, outperforming the 
MUSIC whose MCC was measured to be 0.9115. The MCC of SLIC+STFT 
is plotted for various segment durations against the MUSIC in Fig. The 
proposed approach performs better than the MUSIC for a segment duration 
of about 85 s. For shorter segment durations, the MUSIC [83] demonstrates a sta- 
ble performance, outperforming the proposed SLIC+STFT. For different values of 
bandpass filter order, it worth mentioning that by increasing segment duration, 
an increase in MCC is observed for all cases, as can be seen in Fig. The top 
MCC was achieved for a bandpass filter order of y = 111. The execution of the 
proposed approach was 572.5 s. The execution of the MUSIC [83] method required 
639.5 s to conclude. 
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Figure 4.12: Maximum correlation coefficient of the proposed approach employing 
SLIC+STFT for various segment durations against the MUSIC (mov6). 
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Figure 4.13: Maximum correlation coefficient of the proposed approach employing 
SLIC+STFT for various bandpass filter orders and segment durations (mov6). 
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Table 4.3: Maximum correlation coefficient of the proposed approach employing 
either STFT or ESPRIT and the MUSIC for all recordings. The filter order 
employed is also quoted. 


mov MCC (here) MCC filter order vy ENF samples K 


Mov, 0.9926 0.9658 111 702 
Move 0.9704 0.9466 81 639 
Movs 0.9877 0.9191 51 647 
Mov, 0.9837 0.8700 dl 623 
Movs 0.9432 0.8441 511 729 
Move 0.9309 0.9115 111 TAl1 


4.4.9 Statistical significance of MCC differences 


In order to assess whether the improvements in MCC of the proposed approach, 
employing SLIC and either STFT or ESPRIT, against the MUSIC [83] are statis- 
tically significant, hypothesis testing was applied to all six recordings. The null 
hypothesis, Ho: cy, = cg, indicates that MCCs are equal and the alternative one, 
Ay: c, # Ca, indicates the opposite. 
For each video recording, the MCCs of the proposed approach and the MUSIC 
[83] undergo Fisher’s z transformation [66]: 
z=0.5In : = = (4.7) 


—c 


The test statistic is given by: 
qr = VK — 3(a — 22) (4.8) 


where K denotes the number of ENF samples. The test statistic gp is distributed 
as Gaussian with zero mean value and unit variance, for large Kk. 

It is checked whether the test statistic qr falls within the region of acceptance 
for significance level 5%. If it does so, the null hypothesis Ho is accepted and, 
thus, the differences between the MCC’s are not statistically significant. On the 
other hand, if qr falls outside the region of acceptance (i.e., |gr| > 1.965), the 
alternative hypothesis H; is accepted, indicating that MCC differences are sta- 
tistically significant. Statistical tests constitute an important contribution of this 
Chapter, offering a mechanism for making quantitative decisions, which can lead 
to accurate ENF estimation in practical forensic applications. The top MCC value 
of the proposed approach employing SLIC and either STFT or ESPRIT and that 
of the MUSIC for each recording and the filter order employed is summarized 
in Table 
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In all cases in Table{4.3] ge was calculated and found to be outside the region of 
acceptance for significance level 5%. Consequently, there is sufficient evidence to 
warrant the rejection of the null hypothesis. Therefore, the differences between the 
MCCs are statistically significant and the proposed approach yields statistically 
significant improvements in ENF estimation accuracy against the MUSIC [83]. 
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Chapter 5 


Adaptive hypergraph learning 
with multi-stage optimizations for 
image and tag recommendation 


5.1 Introduction 


In recent years, due to the growth spurt witnessed in social media content, there 
is an increasing need to deploy accurate algorithms, exploiting this volume of 
information into practical applications. For example, the exponential increase 
in social media usage has led users to upload a tremendous amount of images 
on the Web. An abundance of websites enabling image sharing (e.g., Flickr) 
has promoted user interaction. This kind of interaction has been found useful in 
shrinking the so-called semantic gap between the depicted visual content and its 
description in terms of labels or text annotations and amplifies the trend towards 
recommendation systems. 

Tourism recommendation systems (i.e., recommendation of worth visiting places 
of interest [POIs]) have been built upon structures, such as hypergraphs, taking 
advantage of the semantic image annotation provided by users. Websites, such 
as TripAdvisor, Booking, exhort people to share images, comments, and expe- 
riences they lived during their vacations, excursions, or their visits at local ac- 
tivities. POI recommendation systems try to leverage different types of context 
information aiming at improving users’ experience and systems’ accuracy [99]. 
Improvements in image and tag recommendation precision can enhance touristic 
experience. However, optimizing recommendation accuracy and system efficiency 
still remain open problems. 

Hypergraphs are widely used in applications, such as image recommendation, 
annotation, classification, retrieval, computer vision. In [100], an inductive multi- 
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label learning algorithm for 3D object classification was proposed. The objective 
was to learn the optimal projection of training data and optimize the weights of 
multi-hypergraph, simultaneously. A visual tracker employing hypergraph learning 
was proposed in [101], where high-order relations among different frames of a tar- 
get were modeled by hypergraphs. A framework for video object segmentation was 
introduced in [102], where complex spatio-temporal relationships were modeled by 
a hypergraph in order to segment the video into scene objects. There, hypergraphs 
were employed to model the complex relationships among images, where the prob- 
lem of object segmentation was formulated as a hypergraph partition problem. A 
noise-resilient elastic net hypergraph for spectral clustering and semi-supervised 
classification, was proposed in [103]. The lack of training samples hinders hyper- 
spectral image classification. A hyperspectral image classification method, which 
takes into account the spatial context of pixels was proposed in [104]. In par- 
ticular, a hypergraph structure was employed to model the relationship among 
image pixels and improve classification accuracy. Hyperedges among images were 
generated for a fluctuating area about each vertex and high-order relationships 
were modeled. An adaptive hypergraph learning method for image classification 
was proposed in [105]. In multi-label learning, the correlation among labels is 
critical in model construction. A canonical correlation analysis framework was 
proposed to address the multi-label classification problem [106]. Recently, graph 
convolutional neural networks were employed for visual classification tasks in order 
to model pairwise relations in visual data [107]. These relations are not consid- 
ered when classic CNN-based methods were employed. Hypergraphs were used 
to capture high-order relationships in visual data. The high-order correlation was 
optimized and the classification task was performed by employing high-order data 
correlations. Hypergraph neural networks have become a suitable and efficient 
tool for data representation learning. High-order data correlation can be encoded 
making use of such networks. A general framework based on hypergraph neural 
networks for data representation was proposed in [108]. Hypergraph Laplacian 
and truncated Chebyshev polynomials were employed to conduct convolution in 
the spectral domain. Applications of hypergraphs to image recommendation and 
annotation are surveyed in Sec. 

This Chapter elaborates hypergraph multi-stage optimization learning for im- 
age and tag recommendation. It aims at jointly optimizing hypergraph ranking, 
hypergraph structure updating, and hyperedge weight adaptation. These problems 
were separately addressed previously, which motivated the development of unified 
framework. The scheme proposed herein focuses on hypergraph optimizations, 
which allow accurate modeling of complex high-order data relations, enhancing 
their potential to lead to more accurate recommendations and decisions. 

Hyperedge weight adaptation in hypergraph learning was investigated in [109| 
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(110) {17}. Weight adaptation employing the Armijo rule and gradient descent 
was also proposed [112]. A hyperedge weight assigning method was introduced in 
113]. The method depends on the similarities between edges within the same sub- 
graph. An efficient method of hypergraph ranking optimization based on block 
randomized singular value decomposition was discussed in [114]. The method 
reduces computational time by exploiting the sparse and low-rank nature of the 
normalized Laplacian. 

Recently, attention was paid to structure adaptation apart from weight adap- 
tation. Structure adaptation deals with the adaptation of the incidence matrix. 
Since the hypergraph structure remained unaltered during the learning scheme 
(£09) {10} (L711) (172) (173) (T14), a well designed adaptive hypergraph structure is 
expected to provide accurate representation of high-order data relations. Other- 
wise, the learning process will propagate the errors generated during hypergraph 
construction and the ranking results will be deteriorated. A dynamic hyper- 
graph structure learning method, called Dynamic Hypergraph Structure Learning 
(DHSL), was proposed in [115]. In this Chapter, we propose a novel hypergraph 
multi-stage optimization (HMSO) learning scheme, expanding the weight adapta- 
tion in [109]. The proposed scheme jointly optimizes hyperedge weights and hyper- 
graph structure by updating the incidence matrix, as well as ranking for delivering 
accurate recommendations. In HMSO, the constructed hypergraph captures se- 
mantic information among vertices, such as the tags associated to each image, 
geographical information (i.e., the location of what is depicted in the image), and 
visual content by extracting features from the image. These three pillars are mod- 
eled by a hypergraph, which drives the proposed scheme. HMSO is applied to an 
expanded version of the dataset employed in for image recommen- 
dation associated to POIs. Moreover, the NUS-WIDE-LITE dataset [118], which 
was used also in [119] {120} [121], is employed herein to demonstrate the effective- 
ness of the proposed scheme for tag recommendation. A superior performance is 
achieved compared to that of state-of-the-art methods. Existing methods employ 
a static incidence matrix, which suffers from errors, restricting the effectiveness of 
hypergraph frameworks. In this Chapter, we also propose a novel approach for 
hypergraph structure learning, which aims at handling with the failures that may 
exist in the initial construction of incidence matrix. The proposed multi-stage 
optimization scheme is solved from first principles. Analytical derivation of hyper- 
edge weight optimization is given from first principles yielding novel generalized 
expressions. Moreover, in this context, a Least Mean Square (LMS) approach for 
hyperedge weight adaptation is proposed. 

The contributions of this Chapter are summarized as follows: 


e A novel hypergraph multi-stage optimization learning scheme is presented 
for image and tag recommendation. Hypergraph ranking, structure updat- 
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ing, and the hyperedge weight adaptation are jointly optimized, resulting in 
accurate ranking vectors for image and tag recommendation. To the best of 
authors’ knowledge, this is the first attempt to conduct hypergraph learning, 
hypergraph structure updating, and adaptive hyperedge weight estimation 
within a multi-stage optimization scheme. Moreover, systematic study was 
conducted regarding parameter selection in structure optimization, enabling 
more accurate modeling of high-order data relations. 


e From theoretical perspective, both structure and weight optimizations that 
take place in the proposed learning scheme are solved analytically from first 
principles and full derivations are provided. Links with related methods 
are established, correcting any mistakes there. 


e A complementary system for semantic annotation is developed, which ex- 
ploits image features derived by a CNN, as described in Sec. The features 
extracted by the CNN replace the GIST descriptors used in {110} [117]. 


e A new dataset regarding touristic applications is provided to the community 
in order to evaluate their image and tag recommendation systems. 


e An LMS approach for hyperedge weight adaptation is derived and tested 
against the conventional closed expression for hyperedge weight adap- 
tation on a subset of the introduced image recommendation dataset. 


5.2 Related work 


5.2.1 Related work on recommendation using hypergraphs. 


POI recommendation is addressed as image ranking problem. Accordingly, POI 
recommendation employing geographical factorization models (e.g., [99]) falls 
outside the scope of this Chapter. Online images are usually accompanied with 
metadata offering additional information. Hypergraphs are able to model these 
metadata. A hypergraph combining various types of information was employed 
in image ranking problem [123]. An attribute boosted hypergraph approach was 
proposed in [124], where middle-level visual attributes were employed in order to 
enhance the precision of image ranking. By exploting those attributes, a bet- 
ter understanding of image correlations was achieved, which led to more accurate 
ranking results. Probabilistic hypergraph was employed to indicate the relation- 
ship between images. It was integrated within a transductive learning framework 
for content-based image retrieval, as proposed in [125]. Image search was for- 
mulated as a hypergraph ranking problem, where vertices represented images in 
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a weighted hypergraph. A soft hypergraph was proposed to exploit the correla- 
tion among images, where image labels were ranked by a transductive learning 
approach [126]. A hypergraph-based manifold ranking method for unsupervised 
multimedia retrieval was proposed in [127]. Local and global image features were 
utilized in hypergraph learning. A pseudo-relevance mechanism was integrated 
within the hypergraph learning method for tag-based image retrieval in [128]. A 
two-step tag recommendation approach was proposed in [129]. During the first 
step, the relation among users, resources, and their corresponding tags was mod- 
eled by a hypergraph. The second step derives candidate tags for the given inputs. 
In |130], a technique was presented for improving existing tag recommendation 
methods, which were based on graphs. There, a new model of the folksonomy was 
introduced as a directed graph. 

Multimedia content and context were jointly exploited for tagging, retrieval, 
and recommendation [131]. A graph-based reinforcement algorithm for interre- 
lated multi-type objects was proposed in [132], where social image tagging was 
treated as a “ranking and reinforcement” problem. Internal relations of social net- 
works were analyzed, employing a hypergraph topology in [133] and the cold-start 
problem in recommender systems was tackled. Spectral hashing was extended to 
hypergraphs for faster similarity search of images in social networks. A hashing 
framework for image search on social media was introduced in [134], employing 
heterogeneous hyperedges enriched by visual features. Social media image search 
was conducted by employing jointly tags and visual features in [135]. Human- 
based image annotation suffers from noise. Low-level visual features were utilized 
to overcome this kind of errors. A framework, which employed a Markov ran- 
dom walk model with a parameter to balance data fusion between image tags and 
content was presented in [136]. An adaptive hypergraph learning method for semi- 
supervised image annotation was proposed in [137]. Automatic image annotation 
was addressed in [138] by capturing high-order similarities in feature space and 
solving thus the class imbalance problem in the data. 

A tourism recommendation system was presented in [139], where large-scale 
geo-tagged photo collections were clustered according to their location and repre- 
sentative tags were assigned to each cluster. Probabilistic Latent Semantic Anal- 
ysis (PLSA) was used for image annotation and POI recommendation in [116]. 
Geo-tagged images were given as input and they were clustered by location. Af- 
terwards, semantic annotation as well as visual image classification were applied. A 
tourism recommendation system based on hypergraph ranking, enforcing sparsity 
constraints was proposed in [117]. 

A collaborative filtering approach for POIs recommendation was proposed in 
employing hypergraph models to capture heterogeneous high-order informa- 
tion regarding user’s preferences and POIs multimodal contents. 
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Figure 5.1: Modules of the proposed image and tag recommendation system. The 
proposed multi-stage approach is shown in Hypergraph Learning module, where 
the multi-stage optimization problem is solved in an alternating optimization fash- 
ion for all queries (i-e., POIs). Search results depict the output of the system for 
a single query. 


5.3 Recommendation system 


5.3.1 System overview 


The developed image and tag recommendation system consists of three modules 
[141]. The modules are shown in Fig. The first module performs clustering, as 
in [117]. Images are clustered according to their geo-tags into n, geo-clusters, 
which represent POIs. The geo-clusters are sorted according to the number of 
images they contain. More images inside a geo-cluster indicate that more vis- 
itors have uploaded images from an attractive place to tourists. Afterwards, a 
document is generated for each geo-cluster, containing the text information (e.g., 
tags, title) associated to each image within the geo-cluster. To perform semantic 
annotation, a term-document matrix is created and a probabilistic mixture decom- 
position is derived, using PLSA. PLSA reveals the relations between the terms and 
documents, which are captured by the probability distribution of the documents 
and the terms. Each term in a document is modeled as a sample from a mixture 
model. Multinomial random variables, which constitute the mixture components, 
can be treated as topic representations [142]. The conditional distribution of the 
terms given the documents can be decomposed in terms of the conditional distri- 
bution of terms given the topics and the conditional distribution of topics given 
the documents. An online PLSA implementation was proposed in [143]. 

The data generation process is described as follows [142]: 1) Select document 
d with probability P(d); 2) select latent topic z, with probability P(z,|d); and 3) 
generate term t, with probability P(t,|z,). Let t¢, € Ty = {ta,,ta,,°-- »ta,} be 
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a vocabulary term and d € D = {d,,do,--- ,dm} denote a document. The joint 
probability model is defined by the mixture: 


P(ta,d)=P(d) S— Pltalea)P(#ald) (5.1) 


Za€Za 


P(ta|d) 


where Za € Za = {2a,; Za2)°** ; Zan} iS an unobserved class variable representing 
a topic. After having applied PLSA to the document-term matrix, the topic, 
z* with the highest probability is chosen. Next, the 30 most related terms to the 
aforementioned topic z* are selected by sorting the conditional probability P(t,|z*) 
in decreasing order. The number of the most related terms was determined based on 
[116]. A geographical dictionary] is used in order to indicate whether geographical 
information is provided within the most related terms. 

The second module of visual annotation acts complementary to semantic an- 
notation. The deep CNN VGG16 [144] is employed to extract features from every 
image. The network consists of 16 weighted layers, as shown in Fig. All im- 
ages are scaled to 224 x 224, which is the size of images expected by VGG16. The 
mean RGB value across all images is subtracted from each image. Afterwards, a 
forward pass is performed for each image. The output of the last layer, F'C'2, prior 
to the classification layer, is used as feature vector of the corresponding image. 
Feature vectors for all images are extracted and the mean feature vector of all 
images within each geo-cluster is calculated as the representative feature vector. 
Then, the 10 nearest neighbors (NNs) are determined and are inserted in the hy- 
pergraph, as described in Sec. The number of NNs employed herein has 
been chosen empirically and can be adjusted depending on the dataset. 

The third module is the brain of the image and tag recommendation system. 
It exploits the multi-stage optimizations of hyperedge weight update, topology 
learning, and hypergraph ranking in order to make the best recommendations 
to the user w.r.t. the geo-tagged image given as input. Hypergraph vertices 
represent the geo-cluster documents, the topics, and the terms derived by PLSA. 
Hypergraph construction and optimizations are detailed in Sec. and Sec. 
respectively. The novelty of this Chapter is in the development of the third 
module and in particular the solution of optimization problems detailed in Sec. 


5.3.2. Dataset of Greek POIs 


Here, an expanded version of the dataset used in [110 [116] is employed. 
It consists of 99,777 images collected from Flickr, depicting Greek sights, which 


‘https: //www.geonames.org / 
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Figure 5.2: Architecture of deep convolutional network VGG16. The parameters of 
convolutional layers are denoted as conv with the following two numbers denoting 
the size of the receptive field and the number of channels. In this work, we take 
advantage of the /'C2 layer and employ its output as feature vector of each image. 


could be considered as POIs. Each image is accompanied with auxiliary infor- 
mation, such as title, tags, geo-tags (latitude, longitude). Afterwards, distances 
among images were calculated using the <Harvesine formulas!?| and they were clus- 
tered into 12,779 clusters by means of hierarchical clustering. The most worth- 
visiting touristic POIs, were found to be 5,000 clusters, containing the majority, 
i.e., 86,182 images (or 86.38% of total images). A document was generated for 
each geo-cluster, comprising the image title and any image tags that were gathered 
for each geo-cluster. The dataset employed in this Chapter is publicly availablq)] 
A sample of the dataset is depicted in Fig. 

An appropriate vocabulary creation was sought, representing the context infor- 
mation of tourism applications. To do so, the textual information associated with 
a set of 150,000 images is crawled. All letters were converted to lower case and 
symbols were discarded. Finally, a vocabulary of 1,901 terms was derived, where 
each term appeared with a frequency greater than 100. 


5.3.3. NUS-WIDE-LITE dataset 


In order to further evaluate the proposed method, experiments were also conducted 
on the NUS-WIDE-LITH{|dataset. It contains Flickr images accompanied by their 
tags. NUS-WIDE-LITE dataset consists of 55,615 images. Each image employed 
herein was properly chosen so that it had at least 18 tags, resulting to 4,457 


NUS-WIDE. html 
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Figure 5.3: Sample images of POIs in Crete island, which are included in the Greek 
POIs dataset. 


images. The total number of tags was 105,377 of which 1,000 were unique tags. 
90% of total tags, i.e., 94,839, were employed for training, while 10%, i.e., 10, 537, 
for testing. 


5.3.4 Hypergraph construction 


Having performed visual and semantic annotation, the next step is hypergraph 
construction. Let G(V, E', w) denote the hypergraph, where V is the set of vertices, 
E is the set of hyperedges, and w(e) is a real-valued function assigning hyperedge 
e € E a weight to indicate the relative importance of the high-order relationship 
captured by the hyperedge. Geo-tags, image topics, and terms constitute the 
vertex set V. Let « = |V| and A = |E|. The incidence matrix H € R*** has 
elements H(v,e) = 1 if v € e and 0, otherwise. After the initial construction 
of a binary incidence matrix H, an update procedure follows, which projects the 
elements of H onto a feasible set, as described in Sec. The elements appearing 
in the main diagonal of the vertex degree diagonal matrix D, € R**" are defined 


| 5(v) = > w(e)H(0, €) (5.2) 


e€E 


where w(e) is the weight function. The elements appearing in the main diagonal 
of hyperedge degree diagonal matrix D, € R** are defined as 


5(e) = $> A(v, e). (5.3) 


vEV 
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Let diag(-) denote the diagonal matrix with elements in the main diagonal specified 
by the vector argument. The vertex degree and the hyperedge degree matrices are 
defined in matrix form as D, = diag(HW1) = diag(H diag(w)1) and D. = 
diag(1' H), respectively. The incidence matrix H for the Greek POIs dataset 
is of size 7401 x 15000, which consists of |D| = 5000 documents associated to 
geo-clusters, |Z,| = 500 topics, and |7,| = 1901 vocabulary terms. The detailed 
structure of the incidence matrix is shown in Table 

The following procedure describes the construction of the incidence matrix for 
the jth geo-cluster. The procedure is repeated for all geo-clusters. 

A hyperedge e, is created that captures three associations. De, is defined 
by inserting an 1 in the row of the incidence matrix that is associated to the 
document d;. De, is a |D| x |D| identity matrix for |D| = 5000. Moreover, Z, e1 
is created by inserting 1 in the row associated to the top-ranked topic z7 and 0 
to the rows associated to other topics. Z, e1 is a |Z,| x |D| matrix for |Z,| = 500 
and |D| = 5000. One column of Z,e; contains a single one, indicating the latent 
topic associated to a given geo-cluster, as can be seen in Fig. [5.4{a). The 30 most 
related terms linked to the top-ranked topic are marked by 1 in the association 
Ta €1. Ta 1 is a |T,| x |D| matrix for |Z,| = 1901 and |D| = 5000. One column of 
Tz €; contains 30 ones, indicating the terms associated to each topic, as can be seen 
in Fig.[5.4{b). The initial weight for the hyperedge e; is given by w(e1) = P(z2|d;). 

Geographical coordinates constitute a cornerstone in any image and tag rec- 
ommendation system for touristic applications. Thus, a hyperedge e2 is inserted 
in order to capture the geographical relatedness among geo-clusters. Distances 
among geo-clusters are calculated and 1 is assigned to geo-clusters represented by 
d,, if the distance between d; and d, is less than 150 km and 0, otherwise, defining 
thus associations D eg, as can be seen in Fig. [5.4{c). Deg is a |D| x |D| matrix, 
which contains ones if the distance between any two geo-clusters is less than 150 
km. The hyperedge weight w(e2) is set to 1. 

Apart from the similarity induced by the geo-tags of geo-cluster represented 
by d;, visual similarity is also taken into account via a hyperedge e3. The mean of 
CNN feature vectors of all images in the jth geo-cluster is derived as the codevector 
of the associated geo-cluster. Next, the K nearest neighbors of each codevector 
with K = 10 are extracted and 1| is assigned to those geo-clusters whose codevectors 
are included in the 10-NN. De; is a |D| x |D| matrix whose columns indicate the 
nearest geo-clusters to a given geo-cluster w.r.t. the similarity of the representative 
visual feature vectors of geo-clusters. One column of De3 contains K ones for the 
nearest geo-clusters to a given geo-cluster, as can be seen in Fig. [5.4{d). The 
hyperedge weight w(e3) is set to 1. 
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Table 5.1: Hypergraph incidence matrix H for the Greek POI dataset. 


Ej €2 €3 
D De, Deg Deg 
Za Zaei 0 0 
Ty, Toei 0 0 


0) 4. 0) 0) 
0) 1 1 1 
0) 1 0) a 
1 1 1 0) 
0) 0 0) 0) 
(0) 1 0 1 
Le! ke ae, 0) 


Figure 5.4: Sample columns of submatrices of the incidence matrix for the Greek 
POI dataset. (a) Z,e1, (b) Tye1, (c) Dez , and (d) Des. 


When the NUS-WIDE-LITE dataset is employed, hyperedges establish associ- 
ations among images sharing common tags. Accordingly, the incidence matrix H 
of this hypergraph consists of two submatrices, as shown in Table The upper 
submatrix is a 1000 x 1000 identity matrix connecting each tag with itself. The 
lower one consists of as many column vectors c; as the number of unique tags (i.e., 
1000). The dimensions of each vector c; is 4457 x 1. A hyperedge e; defined for the 
column vector (I/ |e; )', where I; denotes the ith column of the identity matrix. 
The j*” element of ¢; is 1, if the 7” image contains the i” tag. Otherwise, it equals 
0. The weight of e; is set, w(e;) = 1,7 =1, 2,..., 1000. 


Table 5.2: Hypergraph incidence matrix H for the NUS-WIDE-LITE dataset. 


Tags 
Tags I, onece II; | arabia [T1000 
Images Ci|.-. |e;] -- - |C1000 
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5.4 Hypergraph multi-stage optimizations 


The proposed multi-stage optimization scheme consists of three stages. Each of 
these stages contributes to recommendation improvement through the multi-stage 
hypergraph optimizations. Ranking optimization, hyperedge weight adaptation, 
and structure learning form the core of the proposed scheme and are presented in 
detail next. 


5.4.1 Hypergraph ranking optimization 


Hypergraph ranking optimization constitutes the first stage. Hypergraph ranking 
is an objective figure of merit, which measures the quality of a POI. Let w be the 
vector containing the hyperedge weights, i.e., W = diag(w). Let also 


A = D,!??HWD;'!H'D,?” € R*™* (5.4) 


be asymmetric matrix. Then, L = I—A € R*** is the so-called Zhou’s normalized 
Laplacian matrix of the hypergraph [145]. L is a positive semi-definite matrix. In 
order to cluster the vertices of the hypergraph, one has to seek for a ranking vector 
f € R*, which minimizes 


Q(f) = STL. (5.5) 


For recommendation, a query vector y € R* needs to be defined as [145]): 


ye . ifu=d, 6.6) 


A(d;,v), otherwise 


with di, denoting the geo-cluster the test image belongs to. The value of (dj, v) 
element of A is treated as a measure of relatedness between different vertices of 
the hypergraph. Strongly connected vertices are supposed to have equal values 
in f [146]. Let f* denote the optimized ranking vector. Exploiting the ¢) norm, 
recommendation can be cast as the following constrained optimization problem 

(a7: 
f* = arg min T(f) (5.7) 

f 

where 

T(f) = O() + 4 [If - yIl2 (5.8) 


with 6 > 0 being a regularizing parameter. Let 


1 


=| 
: 1+86 


A (5.9) 
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The optimization problem (5.7) assumes that H and w are fixed. Let w* and H* 
denote the optimized w and H, respectively. Here, we are interested to solve the 
three-stage optimization problem 


(f*, w*, H*) = arg min T(f, w, H) (5.10) 
f,w,H 


in an alternating optimization fashion. That is, to optimize w.r.t. one variable, 
keeping the remaining two fixed. 
The best ranking vector that minimizes Eq. (5.10) w.r.t. f keeping w and H 


fixed is (147): 
f* = arg min7T(f;w,H) =w J-'y (5.11) 
f 


ee 
where w = = R. 


5.4.2 Hyperedge weight optimization 


The second stage of the proposed scheme consists of hyperedge weight optimiza- 
tion, which aims at optimizing and modulating the impact of the relations between 
the different entities captured by the hypergraph in an efficient, robust, and auto- 
mated way. For simplicity, let Z = D,!/?H € R®™ and p= Z'f € R™!. Having 
optimized w.r.t. f, we then fix f and H, and optimize w.r.t. w, adding an @2 norm 


regularizer for w, and enforcing 1{ w = 1. Then, w* is given by [109]: 


w =argmin {f! Lf+p||w|g}, st. ltw=l 
=argmin {p' diag(w) D2" p + p||wl|3} , 
st, liw=1 (5.12) 
where p > 0 is a weighting parameter. The 2 norm regularizer in (5.12) enforces 
smoothness. 
In the following, the optimization problem (5.12) is solved from first principles. 
It is shown that the solution proposed in [109] is sub-otimal (i.e., a special case of 


the solution derived here). The Lagrangian of the optimization problem (5.12) is 
given by: 


L(w) =p! diag(w) Dz* p + p||w||5 
+n (1, w-1) (5.13) 


where 7 is the Lagrange multiplier w.r.t. the equality constraint in (5.12). If one 
ignores the influence of Dy” in the derivation of the gradient of Y (w) w.r.t. w, 


105 


Chapter 5. Adaptive hypergraph learning with multi-stage optimizations 


as Gao et al. done in [109], then the gradient of “(w) w.r.t. w reads 


Wy 
ee 
a ef 2 
Dela.) 
+ 2uw +71, = —§ + 2uw+ 71) (5.14) 
where 
¢— (fi p3 pr_\' 
Delt)” D225 Des A) 
— (fiat fl aa) f f' zz, f 
(oa 1)’ De(2,2)) 0? peat Pe) 


with z;, 7 = 1,2,..., A denoting the jth column of Z. Setting Vw&(w) = 0 and 
solving for w, we obtain 


1 
*=—1€—7n1)}. 5.16 
w Dy {€ 1) a} ( ) 
The value of the Lagrange multiplier results by solving 1] w* = 1 w.r.t. 7, i-e., 

1 
le {1)€ — Qu}. (5.17) 
It is trivial to show that 

-— 

é= iF t yf = =f'ZD;7'2Z'f. (5.18) 


The substitution of (5.18) into (5.17) yields 
1 
n=5 {£°Z Dez" f — uh. (5.19) 


For the just derived value of the Lagrange multiplier, the jth element of w*, 
j=1,2,..., reads 


ee : ae 


1 
——_— f'ZD,'Z'E. 5.20 
2 De(j,j) 2A em) 


where ps = 107°. 
However, D, = = diag(Hw). Accordingly, the solution for w and 7 given by 


(5.20) as in [109], is an approximation only. In |Appendix A| we elaborate the 
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gradient of p' diag(w) D>! p w.r-t. w to take into consideration the dependence 
of Dy”? on w. It is proven that the expressions for 7 and w are given by: 


w = P{(diag()) 1, —niy} (5.21) 
n= (15 P 1y) i: EN P (ciag(a)) i 1 (5.22) 
where 
Pp = (HD, 97H) diag(m) + 2p Thal 7 (5.23) 
an = D;'?H'D;"’¢. (5.24) 


1 
It is shown that if P = 5aU then (5.21) and (5.22) are simplified to (5.16) and 


(5.17). The solution 70) is associated to the query vector y defined in (5.6). 

Moreover, we propose an iterative approach based on the LMS to provide 
further insight in hyperedge weight optimization. The iterative procedure targets 
to find the minimum mean square error by successive corrections of the w vector 
in the direction of the negative of the optimization function gradient. The LMS 
approach results to the following update rule [141]: 


OL (w 
il wt-1 
where azysg denotes the learning rate. 
Let us define 
xt — WT (pI) *"H diag (xt) eylt-a (5.26) 
-1/2 
nt = pr? AT (pi) f (5.27) 
and 
Hw 
Dit — Howl" 4 (5.28) 
H,wi-} 


with H, denoting the /'” row of H. 
The hyperedge weight vector w at each iteration t is updated as 
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wll — wit _ cans} xt + 2uwl-t — (2u = ayxt-t))1,} 


= wht cans x + 2u(w'- — 1) — 2u(afxtya,} 


= whl cns4 ( — 21 axa)x + Qu(wlN — 1). (5.29) 


5.4.3 Hypergraph structure learning 


If we repeat the same task for all query vectors associated to the n, = |D| geo- 
clusters and stack the solutions column-wise, the A x n, matrix W, is obtained. 
Before proceeding to updating H, we employ a representative vector w,. for all 
geo-clusters. The representative vector w, is given by 


=e EWN as (5.30) 
tg 
where 1,,, is a column-vector of ones. Afterwards, the representative weight vector 
w,. is converted to a diagonal matrix W, i.e., W = diag(w,.). Having calculated all 
ranking vectors f* for each geo-cluster, we integrate them column-wise into matrix 
F in order to save computational time in H optimization. 

The third stage of the proposed scheme consists of hypergraph structure opti- 
mization. Having optimized W for fixed incidence matrix H and F, hypergraph 
representation is further enhanced by solving the optimization problem w.r.t. H, 
i.e., 


H* = arg.min >) f (1 = D,\?HWD, 'H"D;,"”) f, 
q 
_ —1/2 —-lptrl nH-1/2 4% 
= argmin > tr (ID, 2EEWD;'H' D> !/ yee \ 
qd 


= —argmintr {De 7HWD, "HD, "Kt (5.31) 
H 


Y(F,H) 


0 
where K = FF'. To optimize Y(F,H) w.r.t. H, one needs V2 = — py t (FH. 
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In|Appendix B| it is formally derived that [141] 
VQ=J 1 o (H' D,'? KD, H)W D;”| A 
+ [D;?? 0 (HWD;!H" D;?K)|J w- 
—2D71?KD;!?HWD;! (5.32) 


where J is a & X A matrix of ones and I is the A x A identity matrix. Solution 


of (5.31) as is detailed in |Appendix BJ is novel. In particular (5.32) is different 


than Eq. (9) in [115]. In (5.32), the optimization of Y(F,H) w.r.t. H resorts toa 
different, corrected, closed expression for VQ than Eq. (9) in [115]. The updating 
equation derived herein, is referred to as Corrected Structure Learning (CSL). H 
is updated recursively. That is, the update of H at iteration (q+ 1) is obtained by 


H!*! = P[H! — av 2(H")| (6.33) 


where a is a step size and P is the projection onto the feasible set {H|O < H < 1}. 
P is defined as [115]: 


Hi, if0<Hi<1 
PiHjj=4 1, if HE>1 (5.34) 
0, if Hi <0. 


5.4.4 Algorithm outlook 


The proposed method implements multi-stage optimization w.r.t. f, w, and H. It 
is summarized in Algorithm [2] 


5.5 Experimental evaluation 


5.5.1 Image recommendation using the Greek POIs dataset 


To assess the effectiveness of the proposed HMSO method for image recommenda- 
tion, recall-precision curves were calculated for multiple hypergraph ranking meth- 
ods. Each of the methods included in the study was implemented from scratch, 
ensuring a fair comparison. Ground truth was manually inserted, indicating the 
relationships between POIs (i.e., geo-clusters). These relationships were based 
on the distance between POIs, common geographical entities (e.g., whether they 
were located in mainland or island) and leisure activities associated to each POI. 
Specifically, clusters were constructed by applying 10-NN to POIs based on their 
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Algorithm 2: Proposed multi-stage optimization scheme (HMSO) 
Inputs: Weight vector w, incidence matrix H, and parameters 0, a, ju. 
Output: Ranking matrix F. 


1: Compute the diagonal matrices of vertex and hyperedge degrees, D, and D,, 
respectively, as well as matrix A. 
2: Update the weights for each geo-cluster as follows: 
for? = 1,2, v2; %y de 
Compute the ranking vector f* € R**? as in (5.11). 
Update w;, 7 =1,2,...,A as in (5.20). 
(Alternatively use and form w* € R**?). 
Recompute D, and A. 
Recompute f;. 


end for 
3: Compute w,. as in (5.30). 
4: Form F. 


5: Update H with fixed F as in (5.33) using the gradient V2 given by (5.32). 
Until convergence of H(F, w*). 
6: Recompute F. 


geo-location. Afterwards, the generated clusters had been manually inspected and 
corrected to ensure geographical and contextual consistency. The derived clusters 
constitute the ground truth. Ground truth is publicly released. To further support 
the efficacy of the proposed method, it is also tested on the benchmark dataset 
NUS-WIDE-LITE in Sec. In a real-world scenario, the user provides the 
system with a geo-tagged image, which can be captured on the fly using a cellu- 
lar phone. That image is then associated with one of the POIs that have been 
created by the system using its geo-location and its visual information. Thus, the 
query vector y is initialized by setting to 1 the entry that corresponds to the POI 
associated with the input image, as explained in Sec. During the test phase, 
the optimal weight vector w* and incidence matrix H* derived during training 
are employed to compute f* as in (5.11). The computation is performed for the 
geo-cluster where the test image is assigned to. The search results derived by the 
proposed system are considered as correct if they belong to the same cluster of the 
ground truth as the query image given by the user. 

The performance of HMSO depends on various parameters. Extensive study 
was performed in order to apprehend which parameters affect it. The most critical 
parameter is found to be parameter a present in the stage of hypergraph structure 
learning (5.33). HMSO was initially compared against the DHSL and the 
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CSL, which resorts to (5.32). For tourism-related image recommendation to be 
practically applicable, one has to measure the recommendation precision at 1% 
and 5% recall for different values of a. At 1% recall, DHSL precision was equal 
to 93.9%, while HMSO reached a precision of 92.4% for a = 0.1. CSL achieved a 
precision of 90.7%. When a assumed the value of 1, DHSL precision fell to 93.2%, 
while the precision of the CSL was slightly deteriorated to 90.6%. HMSO achieved 
a precision value of 92.3%, 0.1% less than that measured for a = 0.1. For a = 10, 
DHSL precision was equal to 31.3% and was found inferior than that of the CSL 
(i.e., 92.5%). The proposed HMSO achieved a precision of 93.4%. For a = 100, 
DHSL precision dropped significantly and reached 17.7%. CSL achieved a precision 
of 96.4%, while HMSO reached 97.8%. The precision at 1% recall achieved by 
DHSL, CSL, and HMSO for hypergraph ranking is summarized in Table [5.3] for 
different values of a. It is attested that the proposed joint optimization framework 
has a stable performance and for a suitable value of a it is top performer. 


Table 5.3: Precision in image recommendation at 1% recall for various values of 
a. _—_ 
Parameter a 0.1 J 10 100 


DHSL [15] 93.9% 93.2% 31.3% 17.7% 
CSL (Proposed) 90.7% 90.6% 92.5% 96.4% 
HMSO (Proposed) 92.4% 92.3% 93.4% 97.8% 


Let us refer to the method proposed in [116] as ITH. Let us also denote as 
HG-WE and ITH-HWE {110}, two methods for weight estimation in hyper- 
graph learning, which optimize weight vectors. The fourth method employed in 
the experimental evaluation is DHSL [115]. Extensive tests were conducted for 
various recall rates in order to determine, whether the results were appropriate 
for touristic applications. For 1% recall, ITH reached a precision of 90.6%. HG- 
WE was lagging behind ITH, delivering a precision of 88%. ITH-WHE achieved a 
precision of 81.2%, smaller than that of HG-WE, which also performed hyperedge 
weight update. DHSL reached precision of 93.9%, outperforming the aforemen- 
tioned methods due to the fact that during optimization it updates its structure, 
enabling more accurate modeling of high-order relations. CSL, which employs a 
slightly different version of DHSL, achieved a precision of 96.4%. HMSO yielded 
the top precision of 97.8%, outperforming its competitors, because it exploits both 
hyperedge weight adaptation and hypergraph structure optimization eradicating 
any errors exist during hypergraph construction. The joint hyperedge weight op- 
timization and structure learning has led to an efficient exploitation of the re- 
lations among the three diverse sources of information, i.e., visual, textual, and 
geolocation, offering high recommendation precision. That, has enabled accurate 
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Precision 


Recall 


Figure 5.5: Recall-precision curves of the proposed method for various a in image 
recommendation. 


recommendations in real-life scenarios, such as those presented herein. 

An example of tourism-related image recommendation at 1% recall employing 
the proposed method is depicted in Fig. The user provided a geo-tagged 
image of White Tower to the system and top recommended POIs of Thessaloniki 
were returned. The first three recommendations were Hellenic Telecommunications 
Tower (OTE Tower), the statue of Alexander the Great and the Arch of Galerius, 
depicted in the top panel of the figure. All the three POIs are located near White 
Tower. The next three recommended POIs, shown in the bottom panel of the 
figure, were Rotunda, which is visually similar to White Tower, the waterfront of 
Thessaloniki, which ends at the White Tower and the Metochi of Agia Anastasia, 
a fortified monastery on the outskirts of Thessaloniki. In order to visually com- 
pare the proposed approach to the ITH one, an example of tourism-related image 
recommendation at 1% recall employing the ITH method is depicted in Fig. |5.7 
The geo-tagged image provided by the user was the same as shown in Fig. |5.6} 
depicting the White Tower of Thessaloniki. On the right side, the recommenda- 
tions provided by the ITH method are depicted. On the upper left corner the top 
recommended POI is shown, while the rest of the POIs are sorted in descending 
order as derived by the ranking vector, until the lower right corner. Comparing 
the results derived by the ITH method to those derived by the proposed scheme, 
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Figure 5.6: An input-output example of the proposed method at 1% recall. Left: 
Image of White Tower in Thessaloniki, Greece, provided by user. Right: Various 
touristic POI recommendations in Thessaloniki, as suggested by the system. 


there are a few differences to mention. The recommended POIs derived by the 
ITH method were not so relevant to the input image provided by the user. For ex- 
ample, the second POI recommended by the ITH depicted a boat, which operates 
as a cafeteria on the waterfront of Thessaloniki and it was not either a landmark 
of Thessaloniki nor a monument visually similar to the White Tower. Further- 
more, the fifth and sixth recommended POIs depicted a nice view of Thessaloniki 
and its waterfront, respectively. On the other hand, all recommendations offered 
by the proposed scheme depicted landmarks with historical value, which visually 
resemble the White Tower. 

The precision at 5% recall of DHSL was equal to 94.2%, for a = 0.1. The 
precision of CSL was found equal to 89.9%, while the precision for HMSO was 
92.7%. When a = 1, the precision of HMSO slightly increased to 92.8%, while 
the precision of DHSL was equal to 93.7%. The precision of CSL slightly reduced 
to 89.8%. For larger values of a (i.e., a = 10, 100) the precision of both HMSO 
and CSL was increased, while that of DHSL dropped significantly. Specifically, 
for a=100, HMSO reached a precision of 96.1%. Precision at 5% recall for various 
values of a in hypergraph ranking are summarized in Table Recall-precision 
curves of the proposed image recommendation method for various values of a are 
plotted in Fig. |5.5 

At 5% recall, ITH reached a precision of 89.8%, which was slightly smaller than 
that at 1% recall. HG-WE method achieved a precision of 90.2%, exceeding that of 
ITH. ITH-WHE precision was 81.1%. DHSL achieved a precision of 94.2%, which 
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Figure 5.7: An input-output example of the ITH method at 1% recall. Left: Image 
of White Tower in Thessaloniki, Greece, provided by user. Right: Various touristic 
POI recommendations in Thessaloniki. 
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Figure 5.8: Recall-precision curves of different methods in image recommendation. 
Two sub-regions of recall-precision curves are magnified within the main figure. 


is greater than that achieved at 1% recall. CSL reached a precision of 95.8%, which 
is slightly lower than that achieved at 1% recall. CSL outperforms those methods 
that employ hyperedge weight adaptation, demonstrating that proper incidence 
matrix construction and structure learning critically affect accuracy. In that case, 
they are the hyperedge weights that remain unaltered during the hypergraph rank- 
ing procedure adopting an initial user-defined value. That is how the proposed 
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Table 5.4: Precision in image recommendation at 5% recall for various values of 
Q. 
Parameter a 0.1 1 10 100 


DHSL [115] 94.2% 93.7% 27.1% 15.21% 
CSL (Proposed) 89.9% 89.8% 91.8% 95.8% 
HMSO (Proposed) 92.7% 92.8% 93.6% 96.1% 


HMSO differs from its competitors and the reason why it outperforms them. Hy- 
peredge weight adaptation in combination with structure learning enabled HMSO 
deliver a precision of 96.1%, being the top performed method. 

The results derived at 10% recall followed similar behavior with those at 5% 
recall and they were as follows. ITH achieved a precision of 88.3%. As recall in- 
creased, the precision of ITH was deteriorated. HG-WE precision reached 90.9%, 
outperforming that of ITH. ITH-HWE precision was measured 80.9%. The DHSL 
method reached a precision of 94.2%, being one of the best performing meth- 
ods. CSL achieved a slightly superior precision (i.e., 94.6%) than DHSL. HMSO 
remained highly accurate, yielding a precision of 95.5%, outperforming its com- 
petitors. Although the precision of HMSO reduced as recall increased, it is the 
highest. 

Precision of hypergraph ranking methods at various recalls are summarized in 
Table Recall-precision curves for various methods are overlaid in Fig. 
They indicate that the proposed HMSO outperforms its competitors at all recall 
rates for practical applications (i.e., less than 0.4). The execution of the proposed 
HMSO approach required 1314.144 sec. and was implemented in MATLAB®. A 
64-bit operating system with an Intel©(R) Core(TM) i9 — 7900X CPU at 3.3 GHz 
was used in the experiments conducted. 


Table 5.5: Top image recommendation precision for various hypergraph ranking 
methods applied to Greek POIs dataset. 


Recall 1% 5% 10% 

ITH 90.6% 89.8% 88.3% 
HG-WE [109] 88% 90.2% 90.9% 
ITH-HWE [110] 81.2% 81.1% 80.9% 
DHSL 93.9% 94.2% 94.2% 


CSL (Proposed) 96.4% 95.8% 94.6% 
HMSO (Proposed) 97.8% 96.1% 95.5% 
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5.5.2 Impact of hyperedges (Ablation study) 


The developed image and tag recommendation system presented in Sec. [5.3.1] ben- 
efits from three different sources of information. The spatial identity of each POI 
accompanied by its visual representation and semantic description contribute to 
an efficient derivation of ranking vector. Hypergraph enabled modeling of these 
different sources of information deriving accurate results employing the proposed 
HMSO, as presented in Sec. The research question that arises is whether 
these three sources of information contribute equally to precision. Four different 
combinations of recall-precision curves have been derived in order to measure the 
importance of each source of information, as depicted in Fig. |5.9} The top perfor- 
mance was achieved when the complete incidence matrix was used employing all 
hyperedges. For the top recommended POI, a precision of 94.23% was measured. 
That combination demonstrates a stable performance, as can be seen at the cor- 
responding curve. As explained in Sec. [5.3.4] depending on the associations (i.e., 
geolocation, visual and semantic similarity) among POIs new hyperedges had been 
created to indicate those relations. When the hyperedges associated to semantic 
information are not taken into consideration during the construction of incidence 
matrix, the precision for the top recommended POI drops to 84.62%. Although 
that recall-precision curve performed better for higher recall rates it becomes ob- 
vious that the semantic contribution plays a significant role in recommendation. 
When the hyperedges associated to visual similarities didn’t be involved in hyper- 
graph construction and only geolocation and semantic information were considered 
the proposed HMSO delivered a precision of 88.46% for the top recommended POI. 
Although it surpassed the combination of geolocation and visual information, it 
didn’t demonstrate a stable performance assuring that hyperedges associated to 
visual similarities contribute crucially to the overall performance. Finally, the 
contribution of visual and textual information results to a precision of 53.85% for 
the top recommended POI. As expected for a POI recommendation system, the 
geolocation information contributes the most to the derivation of accurate results. 
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Figure 5.9: Recall-precision curves for various combinations of different sources of 
information. 


5.5.3 Image recommendation using the LMS approach 


In order to evaluate the proposed LMS approach for hyperedge weights adaptation 
presented in Sec. a subset of the presented image recommendation dataset 
was used. Specifically, 25 POIs of the city of Thessaloniki accompanied by their 
corresponding textual information were employed. The procedure that was fol- 
lowed to construct the incidence matrix was the same as the one presented in Sec. 
The only difference with the recommendation system presented herein was 
that the proximity limit had been changed. Since that example refers to the city 
of Thessaloniki, the limit of 150 km for two POIs to considered close to each other, 
was reduced to 500 meters. 

Recall-precision curves were employed to assess the contribution of the pro- 
posed LMS approach within the proposed multi-stage optimization scheme. Let 
us refer to the proposed HMSO that employs the proposed LMS updating of hyper- 
edge weights, as HMSO (LMS). Regarding the top recommended POI, the HMSO 
(LMS) achieved a precision of 94.23% outperforming HMSO, which resulted to 
precision of 88.46%. HMSO (LMS) demonstrated an overall stable and efficient 
performance, as depicted in Fig. Although HMSO (LMS) performed slightly 
better than the proposed HMSO still suffers from high computational complex- 
ity. This is, because it is an iterative method, involving calculations among large 
matrices, which are necessary in real-world applications, such as those presented 
herein. In that work, the learning rate was set at azjg = 10~° and the converge 
threshold was chosen to be tT = 107°. 
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Figure 5.10: Recall-precision curves for HMSO (LMS) and HMSO. 


5.5.4 NUS-WIDE-LITE dataset 


To assess the proposed method in image tagging, experiments were conducted on 
the publicly available NUS-WIDE-LITE dataset [118]. The proposed HMSO ap- 
proach and the CSL one were tested against the state-of-the-art approaches. For 
the image tagging task, hypergraph was constructed by employing 90% of total 
tags, while the rest 10% were used for testing, as explained in Sec. [5.3.3] and [5.3.4] 
Specifically, 10% of tags were going to be predicted by the proposed system. The 
query vector y € R**! and A € R*** with « = |Tags| + |Images|. Let y be parti- 


tioned to y = [yp|y;]' and the affinity matrix A be split as A = ae 
Arr | An 


The bottom subvector y, is of size |I!mages| x 1. If the query vector refers to the 
i” image, then y;(i) = 1 and all other entries are set as y;(’) = Az,(i,7’), i’ Ai. 
The top subvector yr is of size |T'ags| x 1. The entries are set as yr(j) = Arr(i, 7), 
j =1,2,..., |Tags|, where A;r is determined during training. f* has a similar 
structure to y. The tags associated to the largest values of f* derived by are 
recommended. The state-of-the-art DHSL approach reached a precision of 37.4% 
for the top recommended tag, while the corrected version of structure learning 
derived herein reached a precision of 40.4%. CSL outperformed DHSL, because 
high-order data relations were captured in a more efficient way through the correct 
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optimization of the incidence matrix. Resorting to weight optimization only, HG- 
WE precision was measured to 30.27%, while ITH delivered a precision value of 
22.42%. The proposed HMSO yielded a precision of 41% demonstrating its poten- 
tial over the other methods. HMSO adopted both structure learning and weight 
adaptation enabling accurate hypergraph ranking. Tag recommendation require 
high precision at a low recall rate as well. For the top two recommended tags, 
DHSL achieved a precision of 33.4%, which was outperformed by CSL offering a 
precision of 37.6%. HG-WE reached a precision of 27.8%, while ITH delivered a 
value of 20.3%. The proposed HMSO approach reached a value of 38.2%, attest- 
ing its superiority over the rest of the methods emerged from the combination of 
CSL and hyperedge weight update. For the top three recommended tags, DHSL 
delivered a precision of 31.3% lagging behind CSL, which yielded a precision of 
34.7%. The proposed HMSO yielded a precision of 35.3%, reaffirming its effective- 
ness over the other methods. HG-WE and ITH reached precision of 25.9% and 
18.4%, respectively. Tag recommendation recall-precision curves of the proposed 
HMSO and CSL against state-of-the-art methods are plotted in Fig. HMSO 
offers a slightly higher precision in tag recommendation when NUS-WIDE-LITE 
is employed than CSL for recall rates less than 0.4. Both HMSO and CSL outper- 
form the other state-of-the-art methods. The execution of the proposed HMSO 
approach required 742.144 sec. to conclude. 
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Figure 5.11: Recall-precision curves of different methods applied to NUS-WIDE- 
LITE dataset for tag recommendation. 


Weight optimization 


In the following, we elaborate the gradient of p' diag(w)Dz'p w.r.t. w: 
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an Oe Z®, macae ZW®, Panne ®,Z' ff! 
Ow Ow Ow 


= 0 0 0 Tal 
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where 


©, = WD.'Z'ff' =) ff! 
®,—D'zZ'ff! 
®, = ZWD". 


Let VW = D,”. The first term in (A.1) is rewritten as 


0 0 0 
—— tr? Ze, $ = — ae = — 
Bw) Tr { i} aw, tr {D; He} Bw A {Haw} 


a ‘aw 
where / = 1,2,---, A. Since W is a diagonal matrix 
O 
pt (He,v) = (Ha, 0 Texx). (A.3) 
Moreover, 
Ow O 1/2 
— = —— H 
Ou Ou; 6 ( my 
((a'yTw) ) 
= _l a oe ale 
aa ((n?) w) - 
0 ((n*)"w) 
- diag (h;) (A.4) 
where (h‘)', 7 = 1,2,---,« and hy,! = 1,2,---, A denote the i row and I 


column of H, respectively. The substitution of (A.3) and (A.4) into (A.2) yields 


0 1 T 
a {ze,} = ~5 tr { He, ie) hin 


((n')"w) ce 0 


. diag(h) (A.5) 
0 ((n")'w) ae 
The second term in (A.1) reads 
_ tr {zwe,| = by Zolyxy. (A.6) 
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Accordingly, 


Jat Zwees| = ba (A.7) 


The third term in (A.1) is equivalent to (A.5), because 


a) a) 
=a Z'@' $= — tr2Z@, S. A. 
Ow) wf rh Ow) rf i} a 


To sum up, 


= 
sf zwD;"2"*| = —tr { He, ° tee ; 
Ou; 


(hty'w) 0 
(my) -diag(h)} Es |®.Z| ; 


0 ((n")Tw) 


a 2 He, | (@)Tw) ee 22) ; (A.9) 
It can be shown that 
He, . 3 Hy (WDz}) 1 (Z ££"), (A.10) 
= 
and 
22 a S7(D2'2"¢ £")4(Dz"/2) Ha. (A.11) 


i=1 
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The substitution of (A.10) and (A.11) into (A.9) yields 
a) 


Ow, 


= 9 (o Hy (WD2") 17 (Z'f fy): 


i=l U=1 


. ((n')Tw ) —3/2 Hy 4 y@ D-'Z'f fj i(D,/?) «Ha 
D. wot i; Sm ee Wor | (Do) is 
Benn falbam Tak) Nor Fe 


=1 i=1 /=1 


(Dy) Ha 


{fZwD: are} 


d 
2 
= (Do")u (f'D,"7h,) — yoy (Do*)yy (f'D; 17h, )- 


=1 


(hy D3”? hy). (A.12) 
Let 
c= (02), (Pm) =(D2"),m a 
Then 
O let -lgT 
— if 'Zwo'z't} = 
WI 
» 
— So wpy (De )yy & ef (H"D;*?H) e; (A.14) 
vat 


where e; is a unit vector having elements (e;)¢ = 6iq with 6g denoting Kronecker 
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delta. is rewritten as 


2 {f2w; zt} = 


Ow, 
r 
a ( > wy & °) (H'D,*”H) e] 
'=1 


wy 1 1yx 
=@-Jel|---lex]} : | (A™D9?H)e 
| Ey Tye 


vi 


= ¢2_yT |(w of) @ hx] (H"D;*?H) e; 


Let 
G= (D,*"H) l(w fo) €)! Q Tyea]v 
then 
a my 
jaf awDe zt =(€0£)-8° 
Ow ‘ 
aN 


= (608) ~ (H'D,*”H) [(wo8)” Sha] v. 
It can be shown that € = De”? H'D,'/’f. Moreover, 
(wo €)'@Lx, = (wl o€") @Lyxy 
=w'o (ores HD;""| Siex 
If we substitute into (A.17), we obtain 


0 
jaf weit = 
Ow 
_ (D; 4H" De 7 D;7H"D;1"¢) = 
a 
— (H"D; HH) zal — jzal] LS 
en 


- (D: D-!2H™D —1/2¢ 4 D;?H"D;"f) — 
(a D,*?H) z 
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where z = woD,!?H'D,'"f. 
Let a be defined as in (5.24). Consequently, 


a {MZwoz2z"t| = 
= (x O rm) — (H'D,*”H) (w O m). (A.20) 
For the constrained optimization problem (5.12), the Lagrangian is given by 
LY (w) = f° WZD1Z"F + pllwl]3 + n(1.w = 1). (A.21) 
By setting Vw4(w) = 0, we obtain 
_ (fw Z D,1Z"f) + 2uw +71, =0 (A.22) 


Replacing into (A.22) yields 
(H"D;*?H) ( diag() )w — ( diag()) gee: 


+ 2uw +71, = 0. (A.23) 
is rewritten as 
|(H"D,°?H) diag(m) + 2p Tha] = (x O rm) — 1x1 
= (diag(m)) 1, — 71 x1. (A.24) 
The solution w.r.t. w reads 
w= P{(diag(m)) 1, nigh, (A.25) 


where P is defined in (5.23). 
The Lagrange multiplier is found by solving 1) w = 1, ie., 


2 
l=1lw= 1XP( diag(m)) 1,— 71) P1y (A.26) 


yielding 
n= (1fP1,) [1{P(ding(m)) 1, = 1]. (A.27) 


Setting P = (2u Ty.) = aan and substituting it into (a.25 and (4.27) we 
Ll 


obtain the approximations of w and 77 derived in [109}: 


i 
w*= Wi {€—n1)}. (A.28) 
and 
2 1 1 
n= > ae =5 1x6 - 2). (A.29) 
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Incidence matrix optimization 


H* =argmin )~£} (1 = D,"HWD,'H"D,"”) f, 
uf qd 


= —argmintr {DotHWwD-H'D,"?K (B.1) 
H 


where K = FF! € R*** and F = [f,|... |fn] € R**”. Taking the partial derivative 
of tr{Dy'"HWD-! H'D, kK} wrt. H, yields 


F) 
—tr{KD7!?HWD;'H'D-!/?2$ = 
OH wf Vv e Uv 
a a 

= aq tt{ KD," Z\ + att {ZH Zs\+ 
4s ae. D-!Z \ # ae. H'D,"?\+ 

dH at ae) nee 

a 
+5qtt {Z:D, 1/1} (B.2) 


where the following auxiliary matrices are employed, i.e., 


Z, = HWD;'H'D,;'” Z, =KD,'”” 
Z3 = WD;'H'D;!” Z, =KD;'?HW 
= H'pD,!/? Ze = KD,'”HWD;! 


Z; =KD,'?HWD;'H'. 


Let us elaborate the 2nd and 4th terms in (B.2), i-e., 


at {Z,HZs} a aa {Zc H’D, 1} 
= ZZ) + D5 ?Ze 

=p"? (K" mn K) D>!?HWD:! 

= 2D7'?KD>!?HWD;!. (B.3) 


because K is a symmetric matrix and W and D7! commute as diagonal matrices. 
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rU= D,”, the Ist term in (B.2) can be rewritten as 


ah tr{KD,1?2,} = tr {Z,KU } 


aj 


-{[Fvf{axo}] 58 


aj 


ee 
= { [2K of] OH, diag vcr w 1} (B.4) 


diag” !/?(H W »} 


where we have exploited that |Z ,K o 1 is a diagonal matrix. 


It can be shown that 


sate {KD;"” Z,} = 


=S°(ZiK) qq a diag” '/?(H W 1) (B.5) 


qq 


where 


, Mirwn a () 
» [(cem) 
( iy Hay.) aa 
0 os Hytwn) 


fe 


=f HE) Moy (BS) 
0 


where the only non-zero element is located in the ith row. To sum up, 


fe) 
aa {KD Z,} - 
i 
=—5(ZiKo Dy?) Jxx W. (B.7) 


In the same manner, the 5th term in (B.2) reads 


a 1 
tr {Z;D,1} = —= (Zz 0 De®) J, W. (B.8) 
ale 2 
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Adding the 1st and the 5th terms, yields 
~f tt (KD;"Z ) +tr (Z p,¥?)} 
dH eee at 
1 . 
=-5 (ZK oDz3/? + Zr 0 p; | Sica W 


1 
eae [(Kz] + Z;) 0 By Jicx W 


1 
= “34 |KD,?HD,'WH"+ 


& KD,"?HWD,'H'| 0 D5? Sona Ww 

= -{ |KD,'?HD,'WH" | ° Deb Soa W 

= -[D;*? 0 (HWD;"H™D;"K)] J... W. 
The 3rd term in is rewritten as 


a a 
4 ZiD_'Zsb = == {ZsZ.D,"} 
aH, (2D. long 


= { [232 01] sealing [(17)~| 


7] 


= S(ZsZa)ey Lai diag"(1 8) 


p=4 a vy 


In addition, 


To sum up, 


a 
sa {ZiD_1Z5} = —Sxa (ZZ 5 Thx) D-? 
a5 (H'D,?KD; "HW 0 1) D-? 


= Jax Taxa ° (H"D; "KD; "?H)WD,?| 
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Substituting (B.3), (B.9), and (B.12) into (B.2), yields (5.32), i-e., 


fe) 
V2= =a {KD,"?HWD,'H"D;"" = 


Jnxa{Dyxa 0 (HD; KD; 1) WD;?] + 
+ [D;° o (HWD;1H"D;"?K)|J,x, W- 
— 2D7!?KD;'/?HWD;!. (B.13) 
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Chapter 6 


Block randomized optimization 
for adaptive hypergraph learning 


Social media platforms store huge amount of multimedia content daily and en- 
courage users to provide descriptions and tags about it. This perpetual, dynamic 
procedure peaks with users sharing the content among the community. As a con- 
sequence, ever increasing data are stored every second in companies’ servers. Han- 
dling these data becomes a very important task. The era of big data has motivated 
research towards integrating and developing new methods to cope with large vol- 
umes. Large-scale matrix analysis requires huge amounts of resources (i.e., time, 
memory). 

Randomized algorithms for such large-scale matrices are widely used in order to 
derive approximate low-rank matrix factorizations. They exploit the structure of 
matrices to provide partial decompositions. For example, in hypergraph learning, 
one has to cope with a large hypergraph Laplacian or adjacency matrix, posing 
difficulties in direct matrix inversions or Singular Value Decomposition (SVD) 
computation. Moreover, when it comes to large-scale problems, classical methods 
may face difficulties in effectively handling inaccurate or missing values. The main, 
underlying idea of randomized techniques for low-rank matrix approximations is 
to seek for a subspace that captures most of the action of the raw matrix [149]. 
Afterwards, common deterministic approaches can be applied. It is shown through 
extensive experiments and detailed error analysis that randomized algorithms for 
low-rank approximations are often more robust, accurate, and faster than the 
classical methods, such as the direct SVD [149]. 

Recently, a low-rank approximation of large-scale matrices was proposed using 
a sparse orthogonal transformation matrix for reducing data dimension [150}. In 
151], an efficient randomized scheme for approximate matrix decomposition with 
SVD was presented. An efficient probabilistic scheme with finite probability of 
failure was introduced [152]. A fast deterministic method to solve the high dimen- 
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sional low-rank approximation was proposed [153]. There, a randomized algorithm 
was adopted and more information was exploited thanks to a sparse subspace em- 
bedding. Recently, a low-rank approximation of sparse matrices based on Lower- 
Upper (LU) factorization was presented [154]. Column and row permutations were 
adopted, searching for an optimal trade-off between speed and accuracy. In [155], 
a fast and accurate method of a closed-form proximal operator for solving nuclear 
norm minimization and its weighted alternative was proposed. This method re- 
duced significantly the computational cost due to avoiding a direct computation of 
SVD. An approximate basis, capturing the range of the input matrix, was formed 
by its compressed edition. 

Hypergraphs find extensive application across diverse scientific disciplines for 
modeling high-order relations among heterogeneous vertices. They provide a solid 
theoretical foundation that supports their use in numerous domains such as mul- 
timedia search, data mining, mathematics, etc. |156). 

In this Chapter, the adaptive gradient descent hypergraph learning scheme 
presented in [112], is extended by implementing block randomized SVD in opti- 
mization step to reduce time requirements. In this first approach, even though 
randomized methods are suitable for low-rank matrices, matrix tesselation enables 
the application of randomized SVD to full-rank matrices met in the optimization 
problems addressed. Creating rank deficient blocks in the main diagonal, allows 
low-rank submatrices to be inverted, reducing execution time through block ran- 
domized SVD [157]. Moreover, a second application for image tagging is proposed 
that benefits from the conjugate gradient method employed in the solution of the 
associated optimization problem. Using these two approaches, we can exploit the 
benefits of the adaptive hypergraph learning scheme even when a large number 
of images are used as input. These two approaches are the novel contributions of 
this work, enabling the application of adaptive hypergraph learning to even larger 
hypergraphs than that considered here. It is demonstrated that both approaches 
achieve accurate image tagging measured by the F) measure as in [158] and succeed 
to reduce drastically the computational time requirements. 


6.1 Adaptive hyperedge weight updating model 


Let | - | denote set cardinality, || - || be the @:-norm of a vector, and I denote 
the identity matrix of compatible dimensions. A hypergraph G(V, E, w) captures 
high-order relationships in social media, where V is the set of vertices, E is the 
set of hyperedges and w() is real-valued function assigning weights in hyperedges 
to indicate the relative importance of each high-order relationship captured by the 
hyperedge. The vertex set V is made by concatenating sets of objects of different 
type (users, social groups, geo-tags, tags, images). Let m = |V| and n = |E|. The 
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incidence matrix H € R™*” has elements H(v,e) = 1 if v € e and 0 otherwise. The 
following degrees are defined: 6(v) = }).<, w(e)A(v,e) and d(e) = YO cy Hv, e), 
which appear in the main diagonal of the vertex degree diagonal matrix D, € 
R™*<™ and hyperedge degree diagonal matrix D, € R"*”, respectively. Let W € 
R”"*” be the diagonal matrix having as elements the hyperedge weights w(e), e € E 
in its main diagonal. 'To measure the degree of similarity of any pair of vertices, 
one has to compute matrix A € R™*™, see Eq. (5.4). 

Let L = I— A € R™*” be Zhou’s normalized Laplacian of the hypergraph 
145]. To cluster the vertices of the hypergraph, one has to minimize Eq. (5.5) 
w.r.t. f € R”. Highly connected vertices are meant to have equal values in the 
optimal ranking vector f* [146]. So, there is high possibility two images to be very 
similar, if they share a number of common tags above a specified threshold. Both 
A and L are frequently sparse matrices. The clustering optimization problem (5.5) 
can be treated as a ranking problem by importing the ly regularization norm in 
order to force the ranking vector f to be as much as possible equal to a query 
vector specified by a user [147]. Let X =I- om € R™*™, where @ is 


1+0@ 
regularization parameter. Given W, the best ranking vector minimizing Eq. (5.5), 


is given by Eq. (5.11). 

The next step is to update W € R”*", using the steepest descent method, 
as proposed in [112]. Let w = (wy, wo, ..., Wn) be the vector formed by the 
diagonal elements of W. Meaningful constraints on w are 1/w = 1 and w > 0. 
Let P(w) =f'Lf +«||w||?, where « is a positive regularization parameter. When 
f is fixed, the optimization problem w.r.t. w is defined as 


argmin P(w) st. l}w=1landw>0. (6.1) 
where the abbreviation s.t. stands for subject to. The Lagrangian of (6.1) is given 
by S= P+ Hi c; Gj, where cj, 7 = 1,2,...,@ are the Lagrange multipliers and 
active constraints Gj; are given by: 


a _ 
oa 1=0 forj=1 (6.2) 


Wr,;-1 = 0 for; > 1. 
The equality constraint G, is always active. The weights of the remaining active 
constraints stay 0, where v; — 1 € [1,n] is an index of a hyperedge weight. The 
steepest descent rule for updating the weights, w"’” = w°4 — wVS, was studied 
in detail in [112]. Here, we are interested in the derivation of fast algorithms for 
solving by exploiting either block randomized SVD of X or the conjugate 
gradient method to solve 


Xf=——~y. (6.3) 
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6.2. Block randomized SVD and conjugate gra- 
dient method for optimization 
Randomized matrix approximations are performed in two stages. The first stage 


comprises random sampling in order to find a lower-dimensional subspace which 
captures the most of the action of X € R”*”". To do so a column orthonormal 


matrix S € R™*! should be computed such that 
[IX —SS'™X]||,<e (6.4) 
where || - ||, denotes the Frobenius norm of a matrix. If rank(X) = k < m, an 


oversampling parameter 7 is chosen, such that 1 = k+7. Frequently, | is set equal 
to 2k. An m x 1 matrix Q whose elements stem from a Gaussian distribution with 
zero mean and unit variance is essential for the randomized subspace iteration. In 
this case, a small oversampling parameter 7 equal to 5 or 10 yields accurate results 
[149]. 
The second stage of the randomized SVD algorithm consists of the approximate 
SVD factorization of matrix B 
B=S'xX (6.5) 


Le, B= USV". The final solution for U is given by multiplying the approximate 
U with the basis S. The detailed randomized SVD via subspace iteration algorithm 
is summarized in Algorithm 3} 

The results of the approximate SVD solution should satisfy 


|X —UEV'|| <e. (6.6) 
Steps 2-3 of Algorithm [3] can be replaced by forming: 
Y = (XX')"XQ. (6.7) 


To construct matrix S, whose columns form an orthonormal basis for the range of 
Y, one may employ QR factorization. (6.7) is more sensitive to round-off errors 
and thus, when high accuracy is required, Algorithm[3]is preferable. Its superiority 
is grounded on the orthonormalization step included. 

To apply randomized SVD to X, X has to be low-rank. In our case, X defined 


as X = I- Tio is full-rank. To exploit the benefits of randomized SVD, we 
partition X, as suggested in [157], as follows 


Xy X42 
X= 6.8 
be | os) 
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Algorithm 3: Randomized Singular Value Decomposition for low-rank 
matrix approximation via subspace iteration 
Inputs: An m x m matrix X, an integer 7, and an integer number J. 
Output: Approximate factorization of X, where U and V are orthonormal and 
» is non-negative and diagonal, containing the eigenvalues of X. 


1: Create am x | matrix Q“ whose elements are independent identically 
distributed Gaussian random variables with zero mean and unit variance. 
2: Form Yo = XQ) and compute its QR factorization Yo = SoRo 
fori =1,2,...,ado 
Form Y; = X'S,_, and compute its QR factorization Y; =S,R;. 
Y; = XS; and compute its QR factorization Y; = $;R; 
end for 
S=5, 
B=S'X 
: Compute SVD of the matrix B = UV" 
: Set U = SU 


where X 1; and X92 are low-rank submatrices. The inverse of matrix X is given by 


: 


Xu Xp] _ 7 “Ke es 6.9) 
Xo Xo —Z5'Xo1 Xj] Z3" | 
where 
Zi = Xu—XwXpy Xa (6.10) 
Zo = Xoo — Xo X7 Xie (6.11) 


To compute Z;' and Z;' as well as Xj,' and X55 randomized SVD is applied. 
For further reducing the time requirements of inversions, nested tessellations of 
submatrices created along the main diagonal are employed until the minimum rank 
of the created submatrices reaches a minimum value of 50. Regarding the second 
update of X after having completed the alternating update of W, the minimum 
rank of the diagonal submatrices of X is increased to 500. In this iteration, matrix 
X gets even sparser due to the fact that many of its elements are set to zero after 
steepest descent. Thus, block randomized SVD enables further computational 


time reduction in solving (5.11). 
Having derived the approximate SVD of X, i.e., X = UNV" 


xX *=Vr UU! (6.12) 
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with N71? = diag(—, ve 

In the second Approach: to solve the conjugate gradient method is em- 
ployed, which has found to yield further time reduction. Iteratively, the ranking 
vector f;, the residual r; and the search direction p; are updated starting from an 
initial fp. Then, the initial residual vector is given by rp = ia y — Xf and the 
search direction is initiated as pp = rp. For i > 1, the updated vectors are given 


by [159}: 


1 
, —) is non-negative, diagonal and U, V are orthonormal. 
0. 


f, = fitapri (6.13) 
Yi = Yi-1—- a; X pi-1 (6.14) 
Pp = 1+8:PiH1 (6.15) 
where = 
Tj_i1Fi-1 
Q, = ———— (6.16) 
Des X pj 
and ‘7 
Vj, Ti-1 


6.3. Dataset description and experimental evalu- 
ation 


The same dataset used in is employed here, retaining the same exper- 
imental setup. It contains a large amount of Greek places of interest along with 
valuable information related to them. In particular, geotagged photos, both indoor 
and outdoor, are accompanied with auxiliary information, such as id, title, owner, 
latitude, longitude, tags, and views. Only images having many views captured by 
users, who frequently upload many images, are retained. Images with many views 
are assumed to depict worth seeing landmarks which have attracted the interest of 
active users, demonstrating dense social relations (e.g., possessing many friends, 
participating in many social groups). The aforementioned information was crawled 
from social media sharing platforms, e.g. Flickr. Our interest is limited to groups 
that have at least 5 members (i.e., image owners). The specific cardinalities are 
summarized in Table |6.1) A 64-bit operating system with an Intel©(R) Core(TM) 
i7—4771 CPU at 3.5 GHz and 16 GB RAM was used in the experiments conducted. 

For each picture a set of tags was created and a vocabulary was generated, 
including the times each tag appeared. Hierarchical clustering was applied after- 
wards. Detailed information about the dataset, the pre-processing procedure, and 
hypergraph construction can be found in [112]. 
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Table 6.1: Dataset objects, notations, and counts. 


Object Notation Count 
Images Im 1292 
Users U 440 
User Groups Gr 1644 
Geo-tags Geo 125 
Tags Ta 2366 


The query vector y is initialized by setting the entry corresponding to the test 
image Jm and its owner U to 1. The tags J’a connected to this image are set equal 
to A(Im,Ta). The objects corresponding to Gr and Geo associated to the image 
owner U are set equal to A(U, Gr) and A(U, Geo), respectively. The query vector 
y has a length of 5867 elements. During testing, the tags contained in the test 
set were not included in the training procedure. To allow comparisons with the 
previous works [112], the same metric has been adopted, i.e., the F, measure. 
The F, measure is calculated at various ranking positions and four of them are 
included in the detailed comparison (i.e., F{@1, F,@2, F\@5, F,@10). Here, our 
main goal is to demonstrate computational time reduction while maintaining the 
same performance with the methods in {158} {112}. 

ITH-HWEG stands for the adaptive weight estimation method with steepest 
descent using a fixed adaptation step [112]. Let BR-ITH-HWEG refer to the 
adaptive weight estimation, using steepest descent and exploiting the block ran- 
domized SVD via subspace iteration in (5.11). Similarly, CG-ITH-HWEG refers 
to solving by employing the conjugate gradient method. Let ITH denote 
the method proposed in [158]. We have employed in ITH either block randomized 
SVD, yielding BR-ITH or the conjugate gradient method, obtaining CG-ITH. The 
main difference between ITH-WHEG and ITH lies in the fact that in ITH the 
weights are fixed and is calculated only once. 

The basic algorithm ITH requires 4267 sec in order to find the best ranking 
vector, as can be seen in Table That is, ITH algorithm requires about 3.3 
sec/image. By employing block randomized SVD to facilitate matrix inversion, 
the computational time of BR-ITH drops to about 3126 sec. That is, BR-ITH 
requires about 2.41 sec/image, yielding a time reduction approximately 27% of 
the time needed by ITH. The F, measure obtained by BR-ITH is identical to 
that of ITH for all ranking positions, as shown in Table When the conjugate 
gradient method is employed within ITH, the computational time reduces to 355 
sec, yielding a gain of about 92% w.r.t. ITH. That is, CG-ITH requires about 0.27 
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— * —ITH-HWEG 
— @ —BR-ITH-HWEG 
— @ —CG-ITH-HWEG 


) 200 400 600 800 1000 1200 
No. of images 


Figure 6.1: Total time requirements for each method 
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Table 6.2: Time requirements for the optimization problems (5.11) and (6.3) in 
sec. 


Methods ITH {158) ITH-HWEG 
Original 4267 8530 
Block randomized 3126 4606 
Conjugate Gradient 30D 127 


sec/image. The resulting Ff measure still remains the same with that of ITH. 

In the ITH-HWEG we have to optimize not only w.r.t. f keeping w fixed, but 
also w keeping f fixed. The basic ITH-HWEG takes 8530 sec to derive the optimal 
ranking vector. That is, 1.99 times longer than the computational time of ITH. By 
employing the proposed block randomized SVD, the time drops to 4606 sec, which 
amounts to 46% reduction in time w.r.t. ITH-HWEG. That is, BR-ITH-HWEG 
induces a computational cost of about 3.56 sec/image. The F measure achieved 
by BR-ITH-HWEG at certain ranking positions is slightly better than that of 
the basic algorithm ITH-HWEG (see F1@5, F1@10 in Table (6.3). By employing 
the conjugate gradient within ITH-HWEG, the computational time drops to 727 
sec, i.e., to approximately 8% of the computational time required by ITH-HWEG 
(i.e., reduction 92%). The computational reduction does not at all affect the F) 
measure at the four ranking positions listed in Table Thus, the conjugate 
gradient method can be exploited in large datasets and real-time applications for 
a single query image (0.56 sec/image), which might include millions of images for 
optimizing f given w. Figure[6.1|depicts the overall computational time needed by 
all algorithms for various number of images. ITH, BR-ITH, and CG-ITH optimize 
f given a fixed w. ITH-HWEG, BR-ITH-HWEG, and CG-ITH-HWEG run an 
alternating minimization problem of f given w and w given f. It is seen that the 
second optimization step of optimizing w given f when it is included in the loop 
improves F; measure, but at the expense of larger computational time than that 
of ITH. In the latter case, the computational time increases almost linearly with 
the number of images, which could be prohibitive. 
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Table 6.3: F; measure at various ranking positions for various approaches. 


Methods F,@1 F,@2 F,@5 F,@10 
ITH [158 0.312 0.457 0.530 0.445 
BR-ITH (Proposed) 0.312 0.456 0.531 0.444 
CG-ITH (Proposed) 0.312 0.456 0.530 0.445 
ITH-HWEG 0.425 0.682 0.753 0.558 
BR-ITH-HWEG (Proposed) 0.425 0.674 0.756 0.562 
CG-ITH-HWEG (Proposed) 0.427 0.680 0.752 0.557 
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Chapter 7 


A contrastive learning framework 
with style transfer for detecting 
computer generated images 


7.1 Introduction 


Artificial intelligence plays a critical role in the identification and mitigation of dis- 
information, which poses a significant threat to democratic principles on a global 
scale. With the exponential expansion of social media content and the rapid 
advancements in image processing and machine learning, the fight against disin- 
formation has emerged as a top priority. The continuous and rapid evolution of 
multimedia technologies, coupled with the progress in the development of tools 
for CGI creation, has resulted in CGIs attaining a level of realism that renders 
them indistinguishable from NIs to the unaided human eye. Distinguishing CGIs 
from NIs is especially pertinent for addressing the practical challenges posed by 
deepfakes. Deepfakes, being advanced forms of computer-generated imagery, un- 
derscore the critical need for robust identification methods. In practical terms, the 
ability to differentiate between authentic and manipulated content haws become 
indispensable for countering the proliferation of deepfakes across diverse domains. 
For instance, within media and entertainment, where deepfakes can deceive audi- 
ences by depicting fabricated scenarios or altering the appearance of individuals, 
the capability to discern these manipulations ensures the preservation of trust be- 
tween content creators and their viewers. Moreover, in a forensic analysis and 
cybersecurity context, the application of techniques to identify alterations plays 
a pivotal role in verifying the authenticity of visual evidence and mitigating the 
potential for false information dissemination. This practical application of distin- 
guishing CGIs from NIs extends its significance into various industries reliant on 
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visual representation, fortifying the credibility and reliability of presented content, 
which is crucial in an era where deepfakes challenge the authenticity of visual 
information. 

A plethora of image processing techniques and 3D image rendering software 
packages have contributed to the creation of such sophisticated content. Various 
high-quality galleries of CGIs are in the game, such as the Autodesk A360 rendering 
gallery [160], the Artlantis gallery [161], the VRay gallery [162], and the Corona one 
[163]. Notwithstanding the multimedia forgery outbreak, realistic CGIs have come 
to be added to the arsenal of fraudsters. As a countermeasure, there is an urgent 
need to deploy algorithms that can discriminate accurately and reliably between 
CGlIs and NIs. Thus, multimedia forensics draws the community’s attention to 
methods to encounter all kinds of attacks within image forensics [164], including 
approaches for universal image forensics [165], copy-move forgery detection [166], 
splice detection [167], and face anti-spoofing detection [168]. Many approaches 
have also been introduced in the context of image forgery detection that leverage 
gradient-based illumination [169], decision fusion [170], pairwise relations [171], 
and transformed spaces based on image illuminant maps [172]. 

Digital forensics can be useful in determining the difference between NIs and 
CGls. A scenario where CGIs can cause harm is through image manipulation for 
political propaganda, making authenticity validation a crucial aspect. Another 
challenging scenario is verifying the authenticity of images, particularly when of- 
fenders attempt to manipulate child pornography photos digitally so as to appear 
like CGIs. In all circumstances, attesting to the validity of the photographs is a 
key challenge in forensics. 

Distinguishing CGIs from NIs can be treated as a classification task. Until 
recently, many approaches proposed hand-crafted features [173], [174], [175], [1’76], 
[{77| to cope with the aforementioned classification problem, while the majority 
of state-of-the-art methods utilize recently deep neural network (DNN) methods, 
e.g., [178], [779], [£80], [18i], [782], [183]. The latter methods tend to be more 
efficient in discovering hidden patterns and structures in images. On top of that, 
the generalization ability of DNN methods allows for automation, which is crucial 
in real-life applications, even when large training datasets are unavailable. 

Here, to take full advantage of NN methods in terms of learning complex data 
representations and automatically deriving highly accurate decisions, an end-to- 
end convolutional neural network (CNN)-based framework is proposed to discrim- 
inate between CGIs and NIs. To the best of the authors’ knowledge, this is the 
first attempt to demonstrate the potential of supervised contrastive learning in 
the context of discrimination between CGIs and NIs. The proposed framework 
consists of two stages. First, a CNN is proposed, which is based on the ResNet-18 
architecture that employs the supervised contrastive (SupCon) loss presented 
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in [185]. On top of this, and apart from the data augmentation, a complementary 
style transfer module is introduced to enhance training by enriching the network 
with additional negative samples to those of the original dataset. Handcrafted im- 
age augmentations (e.g., cropping, blurring, flipping) provide insufficient variation 
in visual features, limiting the performance of contrastive learning techniques that 
employ them. The style transfer module creates synthetic images (e.g., deepfakes) 
but they can also add artificial visual features to real images. The core idea be- 
hind integrating style transfer is to enable more accurate training by using only 
the original dataset, even when insufficient training samples exist. The derived 
results demonstrate that style transfer improves the accuracy of contrastive learn- 
ing. During the second stage, the trained model is fed to a linear classifier for 
further training using the cross-entropy loss. 

Contrastive learning forces samples of the same class to stay close to each other, 
while samples that belong to different classes are pushed far away. Supervised con- 
trastive learning leverages the label information, providing many positive samples 
to the network instead of self-supervised contrastive learning. Positive samples are 
fed into the classifier using data augmentation procedures. Moreover, stochastic 
weight averaging (SWA) is employed on the network outputs after each stage to 
improve robustness. The style transfer module operates in real time and takes 
advantage of the NIs that constitute the positive class. It introduces a progres- 
sive attentional manifold alignment. Thus, it can dynamically reposition the style 
features of some arbitrarily chosen CGIs by repeated attention operations to align 
the content manifold to the style manifold. With the contribution of the style 
transfer module, the training procedure is enriched with additional incoming sam- 
ples, allowing models with datasets that consist of a limited number of training 
samples to be more robust and effective. Overall, the proposed framework aims 
at identifying and mitigating deceptive visuals, fortifying the trustworthiness and 
reliability of visual content across diverse applications and sectors. 

The experimental results are disclosed on the public benchmark DSTok [186], 
Rahmouni [183], and LSCGB datasets, demonstrating that the proposed 
framework accurately distinguishes CGIs and NIs, outperforming the state-of-the- 
art approaches and motivating further research. On top of that, the generaliza- 
tion ability of the proposed framework trained on the DSTok dataset is tested on 
the publicly available Rahmouni dataset. Moreover, CoStNet is trained on the 
most recent state-of-the-art LSCGB dataset and tested on the challenging DSTok 
dataset and is compared against state-of-the-art approaches. The impact of vari- 
ous parameters during the training is assessed. When the test samples are infected 
with salt-and-pepper or Gaussian noise, an extensive evaluation of the proposed 
approach is performed to attest to its ability to deliver accurate results under vari- 
ous conditions. When insufficient training samples are available, an ablation study 
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is undertaken to examine the impact of the style transfer module. Furthermore, 
hypothesis testing is performed to assess whether the improvements in detection 
accuracy delivered by the proposed framework against state-of-the-art approaches 
are statistically significant. 

The main contributions of the Chapter are as follows: 


e A novel CNN-based framework is designed to discriminate between CGIs 
and NIs, abbreviated as CoStNet. To the best of the authors’ knowledge, 
this is the first attempt to conduct such discrimination based on supervised 
contrastive learning and style transfer in the benchmark DSTok, Rahmouni, 
and LSCGB datasets. 


e A complementary style transfer module, which operates in real-time, is em- 
ployed to increase the training CGIs even when a limited number of training 
samples is available, thus enhancing the training procedure. 


e CoStNet achieves state-of-the-art accuracies in the benchmark DSTok, Rah- 
mouni, and LSCGB datasets, underscoring its remarkable advancement in 
the field. 


e The generalization capability of CoStNet, initially trained on the LSCGB 
dataset, is evaluated through testing on the DSTok dataset. Additionally, 
CoStNet undergoes training on the DSTok dataset and is subsequently tested 
on the Rahmounis’ dataset to assess its broader applicability. 


e The proposed framework is robust against high salt-and-pepper and Gaus- 
sian noise at various corruption levels. 


e Multiple tests are conducted to empirically demonstrate that CoStNet is less 
sensitive to modifications of the training parameters, such as the number of 
training epochs and the batch size. 


e An ablation study is performed to assess the impact of the style transfer 
module when limited training samples are available. 


e Hypothesis testing confirms that the improvements in detection accuracy 
between CoStNet and methods reported in the literature are statistically 
significant. 


In summary, the proposed CoStNet framework is a CNN-based novel architec- 
ture that utilizes real-time style transfer and supervised contrastive learning to 
discriminate CGIs from NIs. CoStNet is demonstrated to accurately discriminate 
CGls and NIs across benchmark datasets such as the DSTok, the Rahmouni, and 
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the LSCGB datasets. The incorporation of the style transfer module allows for 
the augmentation of CGIs based on existing image content, thus offering addi- 
tional training CGIs. By doing so, the challenge of training sample scarcity for 
CGIs prevalent in real-world forensic scenarios is addressed. CoStNet’s robust 
performance in handling various noise levels and parameter settings, as well as its 
generalization ability in testing, further underscores its versatility and effective- 
ness under diverse conditions. CoStNet’s resilience to variations is also evaluated 
in scenarios with limited training data through an ablation study, demonstrating 
its capabilities in CGI discrimination. 


7.2 Related work 


The advances in multimedia forensics, on the one hand, and the sophisticated 
software, which enables the ever-increasing creation of realistic CGIs, on the other 
hand, have challenged scientists to develop new methods to encounter fraudulent 
manipulations arising from such technological advances. A transfer learning and 
convolution block attention module, which considers both the shallow content 
features and the deep semantic features of the image, was introduced in to 
tackle the problem of distinguishing NIs from CGls. Parallel to the evolution 
of algorithms, ever-challenging datasets were released. In [187], the new large- 
scale CG benchmark dataset (LSCGB) is introduced, consisting of 71168 CG and 
71168 natural annotated images. The authors also presented a baseline texture- 
aware network to address the discrimination problem on their benchmark dataset. 
A novel two-branch network was proposed to tackle the generalization problem in 
the blind detection of CGIs by introducing different initializations in the first layer 
so that more diverse features were extracted [179]. However, no prior knowledge of 
new distributions was used to develop a rigorous formulation. In [I8I], color and 
texture characteristics of local patches were integrated within a dual-input CNN 
framework, and a directed acyclic graph recurrent neural network was employed 
to model the spatial dependence of local patterns. 

A statistical model for NIs was proposed in built upon a wavelet-like de- 
composition. Higher-order wavelet statistics showed substantial differences that 
made it possible to distinguish between CGIs and NIs. In , a geometry-based 
model was proposed that utilized the physical characteristics of CGIs and NIs 
in the classification process. Local patches of the image intensity function were 
employed to form a patch distribution, which enabled, in combination with the ge- 
ometry model, to uncover the distinctive physical characteristics of NIs and CGIs. 
The statistical characteristics of local edge patches were examined in [I77], and a 
visual language was created to handle the discrimination between CGIs and NIs. 
In [189], a technique based on sensor pattern noise was developed that used three 
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high-pass filters to filter out low-frequency signals. A five-layer CNN was utilized 
to classify the input image patches, and a majority vote scheme was employed to 
extend the classification results to the full-sized images. A CNN was presented 
in [190], promoting the so-called local-to-global strategy. Forensic decisions were 
derived from local patches, and a global decision based on majority voting on 
the full-sized images was implemented. In [183], a CNN with a custom pooling 
layer was proposed. Local estimates of the class probabilities were employed to 
predict the full-size image label. A deep convolutional recurrent attention model 
was proposed to classify CGIs, and NIs employing a local-to-global strategy [191]. 
Image patches were trained, and the full-sized images were classified using the 
simple majority vote rule. An attention-based dual-branch CNN with fused color 
components was proposed in [192]. There, raw RGB components and their noisy 
versions were given as input to the network, while the attention-based model op- 
timized the output features from the two branches in combination to perform 
detection. A method for distinguishing between CGIs and NIs based on DNN and 
transfer learning was presented in [182]. A qualitative examination of ResNet-50 
bottleneck characteristics for CGI detection was performed. Comprehensive re- 
views of various methods for discriminating between CGIs and NIs can be found 
in [193] and [194]. 


7.3 Proposed framework 


7.3.1 Framework overview 


The proposed framework for detecting and discriminating CGIs from NIs comprises 
two modules, as shown in Fig. Input CGIs are passed through the first module, 
namely the style transfer module, which generates additional CGIs added to the 
training set, thus enriching the training procedure. When there are not enough 
training samples given a CGI as a basis for style semantics, style transfer can 
create as many CGIs as NIs, whose content semantics remain unaltered. The 
style transfer module leverages a pre-trained VGG network [144] to encode the 
content image and imbue it with stylistic patterns derived from a separate style 
image, yielding distinct features for each one of them. Adhering to the framework 
introduced in [195], these features undergo a transformative process facilitated 
by the attentional manifold alignment (AMA) block to achieve stylization. This 
block encompasses a channel alignment module, an attention module, and a spatial 
interpolation module. Once processed through three iterations of the AMA blocks, 
the aligned content feature is fed into the decoder, resulting in the generation of the 
stylized image. In terms of practical implementation, the role of the style transfer 
module within CoStNet is pivotal. This module operates by transferring the visual 
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characteristics of one image (e.g., texture, color, and style) onto another while 
preserving its content. Particularly in scenarios where the availability of CGIs is 
limited, the style transfer module enhances the diversity of CGIs by synthesizing 
new images based on the content of NIs. In such cases, the augmentation enriches 
the training data, thereby improving the robustness of the framework against 
variations in CGI appearance. More details on style transfer learning are given in 
Section An example of a generated CGI based on an NI through the style 
transfer module is depicted in Fig. Details for the style transfer module can 
be found in Fig. |7.3 
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Figure 7.1: Architecture of the proposed CoStNet. 
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Figure 7.2: On the left, a natural image is depicted. On the right, a computer- 
generated image is shown that was generated by the style transfer module. 


Having increased the number of CGIs by employing the style transfer module, 
the total amount of input images is passed to the second module, which imple- 
ments the learning procedure. It consists of two distinct stages. The first stage 
comprises an encoder with a CNN architecture, namely the ResNet-18 architec- 
ture, employing the supervised contrastive loss. Details can be found in Section 
Upon receiving an input batch of the enriched data, random data augmen- 
tation is applied twice to yield two batch duplicates representing a different data 
view. Both duplicates are then forwarded through the encoder network, generat- 
ing a normalized embedding. During training, this representation is further passed 
through a projection network, which is disregarded during inference. The outputs 
of the projection network are used to compute the supervised contrastive loss, as 
proposed in [185]. For classification purposes, the output of the first stage is fed 
into an encoder network identical to that of the first stage and then to a linear 
classifier, which is trained on top of the fixed representations using cross-entropy 
loss, allowing the trained model to be employed for classification tasks. After each 
training stage, SWA is applied to improve the model’s generalization and stability 


[196]. 
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Figure 7.3: Components of style transfer module. 


7.3.2 Style transfer learning 


CoStNet integrates a style transfer module to render a content image with style 
patterns from a reference image. By doing so, training samples are augmented, and 
the algorithm performance improves even when few training samples are available. 
A state-of-the-art arbitrary style transfer framework, called Progressive Atten- 
tional Manifold Alignment [195], is employed, which gradually aligns the content 
and style manifolds using an attention mechanism for consistent stylization across 
semantic regions. The loss function in the progressive manifold alignment approach 
is comprised of several stages. Let L,, denote the content loss, while L,, L,,, and 
L;, denote the style losses. At each stage, the loss is calculated as a weighted sum 
195}: 


3 
L= > (MD, + MLE + NL, + ML) + Lae (7.1) 
i=1 
where Ne refers to a weight parameter for Le, € € {ss,r,m,h} in the ith stage of 
the procedure as described in [195], and La. stands for the autoencoder loss. 

The content loss L,, employs the @; norm between the self-similarity matrices 
of the content feature F’, and the VGG feature of the stylized image F’, [144]. Let 
also H, and W, denote the height and the width of the feature F, with x € {c, s}, 
respectively. The L,, is given by 


1 Ds, Ds 
bss = Ty a Ds SDS 
ie re > ig 2 ij 


| (7.2) 


where D° = |D*,] and D® = [Df;| are the pairwise cosine distance matrices of 
content features F, and VGG features of the stylized image F.,.. The cosine 
distance matrix is defined as 1 minus the cosine similarity between F’, and F’., 
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Let L, denote the relaxed earth mover distance [198} [199] to align the content 
manifold to the style manifold: 


1 1 
L, = max (Gar d. _ Ce HW, dX min cs) (7.3) 


where C;,; denotes the pairwise cosine distance matrix between F., and F. The 
statistic of the style feature is represented by the subscript s, while the statistic of 
the VGG feature of the stylization result is represented by the subscript cs. 

In order to regularize the magnitude of features, the moment matching loss 
was employed {195}: 


Lin = [|Mes — fs|[1 + |[Bes — Us| 1 (7.4) 
where w and & denote the mean vector and the covariance matrix of the feature 
vectors. 

Let L;, be the differentiable color histogram loss introduced in [200]: 


1 
In — H2|lo (Tp) 


1 rE 
J/2 | | 8 
where H refers to the color histogram feature and H 2 denotes the element-wise 
square root. The color histogram feature proposed in is employed to control 
the distribution of colors in the generated images. The color histogram loss func- 
tion encourages the generated images to match a specified color histogram, which 
is a representation of the distribution of colors in an image. 
Moreover, an autoencoder loss Lge is proposed to preserve the shared space 
during manifold alignment. Let I, and I,, denote the reconstructed content and 
style images from the encoded features. The loss is given by |195): 


Lae = Aae(|Ere—Fella+ | Trs—Ls)lla)+ D_(I16i(Ere) — bi Le)ll2+ IleiLrs) - 8)ll2) 


(7.6) 
where A,- denotes a weight parameter which is kept fixed. ¢;(I) refers to a Rectified 
Linear Unit ReLUi_l layer VGG feature of image IJ, where ReLU i_j denotes the 
result of convi_j7 with ReLU activation. The loss Eq. forces the decoder to 
reconstruct features in the VGG space, which in turn restricts all features between 
the encoder and decoder to lie within this space. The loss Eq. retains a 
common space for aligning the content and style manifolds from the standpoint of 
manifold alignment. 

The proposed module is a completely autonomous part of the CoStNet. It 
comprises a channel alignment module designed to accentuate related content and 
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style semantics, an attention module facilitating the establishment of feature cor- 
respondences, and a spatial interpolation module aimed at dynamically aligning 
the manifold structures. 

The channel alignment module utilizes a combination of global average pooling 
and a multilayer perceptron (MLP) to embed F € R!?*”*° into R© and derive 
the corresponding channel weights. H, W, and C denote the height, width, and the 
channels of F. These weights, denoted as A, € R° and A, € R°, are computed 
based on both the content feature Ff, and the style feature F’,. Subsequently, the 
features F., and F’, undergo cross-weighting with A, and A,, respectively, resulting 
in aligned features F, and F. The attention module utilizes 1 x 1 convolutional 
blocks for feature embedding, along with mean variance normalization, to com- 
pute the attention map A,,. The attention map captures pairwise similarities 
between features. Subsequently, the style feature vectors are redistributed based 
on the content feature F according to the computed attention map A,,. The spa- 
tial interpolation module synthesizes spatial information for adaptive interpolation 
between the content feature F and the redistributed style feature F° *. Specifically, 
the dense operation employs multiscale convolution kernels on the concatenated 
feature to compute interpolation weights G. By concatenating the features, lo- 
cal discrepancies between corresponding content and style features are identified, 
enabling the determination of appropriate interpolation strengths. Consequently, 
the spatial interpolation module effectively merges the most similar content and 
style feature vectors, facilitating manifold alignment through linear redistribution 
of the style feature and interpolation of its linear components with the content 
feature. A visual workflow of style transfer module components is depicted in 
Fig. A detailed analysis of each component can be found in [195]. The style 
transfer module is executed before feeding the training samples into the first stage 
employing the CNN. It can be implemented in real-time, depending on whether 
additional samples are needed due to a lack of training CG samples. 


7.3.3 Supervised contrastive learning 


Self-supervised contrastive learning tries to maximize the similarity of two nor- 
malized vector representations (i.e., embeddings), pulling together the normalized 
embeddings that belong to the same class while pushing away the normalized 
embeddings that belong to different classes. In [185], label information was lever- 
aged, and self-supervised contrastive learning was extended to fully supervised 
contrastive learning, enabling the consideration of many positives and negatives 
per anchor. On the contrary, in self-supervised learning, a single positive is only 
considered. Here, the extension in [185] is exploited by including the SupCon loss 
Eq. to the challenging application of distinguishing between CGIs and NIs. 

From a practical point of view, contrastive learning embeds data points into a 


150 


Chapter 7. Detect with style: A contrastive learning framework with style 
transfer 


latent space, where similar instances are brought closer together while dissimilar in- 
stances are pushed apart. Specifically, CoStNet takes advantage of the supervised 
contrastive learning, where the framework is trained to minimize the contrastive 
loss between positive pairs (i.e., samples from the same class, e.g., NIs) and max- 
imize the margin between negative pairs (i.e., samples from different classes, such 
as CGIs and NIs). This process facilitates the learning of discriminative features 
necessary to accurately distinguish between NIs and CGls. This capability is par- 
ticularly important in real-world practical forensic applications, where the accurate 
discrimination is essential for reliable analysis and interpretation. 

Following the notation in [185], let us consider a set of N image/label pairs, 
{£x,Yx$e=1,....N With their corresponding training batch {%, 9 }i-1,...2n, where Lox 
and £o,_; are 2 augmentations of x, and Yor_1 = Yor = Yr- 

Let J = {1, 2,...,2N}. For i € J, let A(t) = J \f{i}. If 7 © R* denotes a 
scalar temperature parameter, define 


p= 20) 


eC exp(*) | 


In Eq. (7.7), 2: = Proj(Enc(@,)) € R??, where D, is the size of a single linear 
layer, Enc(%,) maps £ to a representation vector r;, and Proj(r;) maps 7; to vector 
VAR 

Let 7 be the anchor and (i) denote the index of another augmented sample 
in the same set known as the positive. In the self-supervised approach, the loss 
function is formulated as [185}: 


Let SG = —Y Nog Pi (7.8) 


ie. tel 


(7.7) 


The remaining 2(.V — 1) indices in A(z)\{j(i)} are called the negatives. 
The SupCon loss is a generalization of Eq. (7.8), which leverages the label 
information [185]. The SupCon loss is formulated as follows: 


—1 

sup sup __ : 

so it _ dL Laut _ Ss |P(i)| » log Pip (7.9) 
el tel pEP(i) 

where P(i) = {p € A(i) : J = yi} denotes the set of indices of all positives in 


the set of augmented samples that are distinct from 7 and |P(i)| stands for the 
cardinality of set P(i). 
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Here, the gradient of SupCon loss (7.9) is given by: 
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where N(i) = {n € A(i): Jn F Yi} is the set of indices of all negatives in the set 
of the augmented samples. A detailed visual representation of the second module 
of the proposed framework is illustrated in Fig. 
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Figure 7.4: Learning procedure of the proposed framework. 


7.4 Datasets 


In applications such as the discrimination of CGIs from NIs, which degenerate 
into a binary classification problem, dataset selection acts crucially in the overall 
system accuracy. This is due to the fact that the network should be trained 
on incoming data that resemble real-life scenarios to achieve generalization. The 
need for proper and accurate dataset selection is becoming more apparent as more 
efficient and sophisticated methods and algorithms are released. The ability to 
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handle more complex and challenging datasets, including CGIs and NIs, which 
are difficult to distinguish with the naked eye, is required in the most recent 
deep learning network-based methods. Here, we employ the DSTok [186], the 
Rahmouni [183], and the LSCGB datasets, three datasets that are commonly 
used in the literature, to assess the performance of the proposed CoStNet. A set of 
challenging images of the DSTok dataset is depicted in Fig. Starting with the 
aforementioned datasets, we introduce additional ones appearing in the literature. 
The benchmark datasets are summarized in Table 


e DSTok dataset [186]: The DSTok dataset comprises a total of 4,850 CGIs 
and 4,850 NIs sourced from the Internet. NIs encompass diverse indoor and 
outdoor landscapes captured by various devices, while CGs exhibit photore- 
alistic qualities. This collection boasts high-resolution images, ranging from 
609 x 603 to 3, 507 x 2,737, showcasing significant inter-class diversity. Such 
characteristics position the DSTok dataset as a pivotal resource for research 
in CG image detection, emphasizing its prominence in the literature. 


e Rahmouni’s dataset [183]: Rahmouni’s dataset consists of 1800 high- 
resolution CGIs of size 1,920 x 1,080 pixels downloaded from the Level- 
Design Reference Database [201]. These CGIs were taken from photorealis- 
tic video games (i.e., Uncharted 4, Battlefield Bad Company 2, The Witcher 
3, Battlefield 4, and Grand Theft Auto 5). Only these five distinct video 
games were deemed to exhibit a sufficient level of photorealism and thus 
they were employed. On the other hand, 1,800 high-resolution NIs with a 
size of 4,928 x 3,264 pixels were obtained from the RAISE dataset 
comprising a diverse array of settings, including outdoor and indoor scenes 
such as monuments, houses, landscapes, people bodies and faces, and forests. 


e LSCGB dataset [187]: It is one of the most recent datasets. Its size is 
orders of magnitude larger than that of the preceding datasets. It consists of 
71,168 CGIs and 71,168 NIs. It is characterized by high diversity and small 
bias regarding the distribution of color, tone, brightness, and saturation. 


e He’s dataset [181]: He’s dataset consists of 6,800 CGIs downloaded from 
the Internet. The images were created using a variety of rendering software 
packages, such as Maya, AutoCAD, etc. Another 6, 800 NIs were included in 
the dataset, which were captured under various indoor and outdoor circum- 
stances. All images were stored in jpeg format, and their size ranges from 
266 x 199 to 2,048 x 3, 200. 


e Columbia dataset [203]: The Columbia dataset consists of four sets of 800 
images, resulting in a total of 3,200 images. It consists of 800 NIs captured 
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using the professional single-lens reflex Canon 10D and Nikon D70. These 
images demonstrate content diversity regarding indoor and outdoor scenes, 
various lighting conditions, etc. Another 800 NIs were retrieved from the 
Internet using Google Image Search based on keywords matching the CGI 
set’s categories. A total of 800 CGIs were downloaded from the Internet. 
The images were classified based on their content, such as nature, objects, 
architecture, etc. Various rendering software packages were employed to 
create them. Another 800 CGIs were recaptured from the monitor while 
displaying the set of 800 previous CGIs. 


Figure 7.5: Sample images of DSTok dataset. On the left, a natural image is 
depicted. On the right, a computer-generated image is shown. It is difficult to 
determine that the image on the right is computer-generated with the naked eye. 
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Table 7.1: Benchmark datasets. 


Dataset 7 of 7 of NIs CGI NI ‘Year 
CGIs Sources Sources 
DSTok 4,850 4,850 3D models Photo- 2013 
[186] sharing 
websites 
Rahmouni 1,800 1,800 3D models Existing 2017 
183 games bench- 
marks 
LSCGB 71,168 71,168 Models, Existing 2020 
[187] games, bench- 
movies, marks, 
GANs movies, 
photo- 
sharing 
websites 
He [181 6,800 6,800 3D models Personal 2018 
collection 
Columbia 1,600 1,600 3D models Personal 2005 
203 collection, 
Google 
Image 
Search 


7.5 Experimental evaluation 


7.5.1 Experimental setup and augmentations 


CoStNet works effectively in real-life applications. During the first stage, the 
network was trained for 100 epochs, employing a batch size = 200, while the 
second stage of the linear classifier was trained for 100 epochs employing a batch 
size = 20. The Stochastic Gradient Descent was employed with a learning rate 
of 0.1 and 0.01 for the first and second stages, respectively. The cosine annealing 
scheduler was employed to adjust the learning rate during training. The maximum 
numbers of iterations were set to 100 and 20 and the minimum learning rates were 
set to 0.01 and 0.001 for the first and second stage, respectively. In evaluating the 
performance of the proposed framework, classification accuracy was utilized as a 
primary metric in accordance with the literature and measured using the formula: 
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TP+TN 
TP+FP+TN+FN 
where TP stands for true positives, TN for true negatives, FP for false positives, 
and FN for false negatives. 

The proposed approach was implemented using the PyTorch 1.7.1 framework 


(https: //pytorch.org/), and the hardware settings are indicated in Table [7.2] 


Accuracy = (7.11) 


Table 7.2: Hardware settings. 


Details Configuration 
CPU i9-7900X @ 3.3 GHz 
GPU RTX 2080 Ti 
RAM 126 GB 


Data augmentation positively affects the training procedure and contributes to 
the accurate classification of CGIs and NIs. Three CGIs were employed to operate 
as reference images for content and style semantics during the style transfer module 
preceding the CNN module. All NIs in the training set were passed through 
the style transfer module. Consequently, NIs’ content manifold was aligned to 
the style manifold of CGIs, and a new set of CGIs was created to enhance the 
training procedure. Afterwards, a standard series of data augmentation procedures 
was applied to the dataset images. The input images were (i) randomly cropped 
and resized to 224 x 224 pixels; (ii) randomly rotated; (iii) randomly changed in 
brightness, contrast, and saturation; (iv) converted to grayscale with a probability 
of 0.2; and (v) normalized so that pixel values € [0,1]. CoStNet was tested on 
various benchmark datasets, as described in Section and several series of 
experiments were conducted, including parameters assessment and generalization 
ability (Section [7.5.3), robustness capability (Section [7.5.4), style transfer module 
impact contribution (Section|7.5.5), and statistical significance evaluation (Section 
7.5.6). 


7.5.2 Evaluation results on the benchmark datasets 


To evaluate CoStNet for differentiating between CGIs and NIs, we initially em- 
ployed the public benchmark DSTok [186] dataset. CoStNet was compared against 
state-of-the-art methods with respect to classification accuracy. The classification 
of 14 state-of-the-art methods is that reported in [187]. CoStNet achieved 
a classification accuracy of 97.11% exceeding by 1.05% the CGNet proposed in 
178], which was based on transfer learning and attained an accuracy of 96.1%. 
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The method proposed in [179] was lagging behind, reaching an accuracy of 95.30%. 
The four best-performing methods were concluded by the inclusion of the method 
in [182], which resulted in an accuracy of 95.02%. The accuracy reported herein 
was achieved after 100 epochs of training with a batch size equal to 200. The 
accuracies of all methods employed are listed in Table[7.3} It is worth mentioning 
that the first training stage provided a well-trained CNN yielding a high validation 
performance before representations were fed into the linear classifier in the second 
training stage. During the first training stage, the training loss decayed rapidly 
after approximately 10 epochs, resulting in high accuracy, as demonstrated by the 
experiments conducted in Section [7.5.3] The training loss of the CNN, as well as 
its validation accuracy, are plotted in Fig. and Fig. respectively. 

In the context of the Rahmouni dataset, CoStNet demonstrated remarkable 
efficacy, achieving a remarkable accuracy of 100.00%. The derived accuracy posi- 
tions CoStNet as the leading method among the compared approaches, showcasing 
its prowess in distinguishing CGIs from NIs. Notably, CoStNet surpassed all other 
methods, including the closest contender Bai [187], which attained an accuracy of 
99.94%. This substantial margin underscores the robustness of CoStNet in CGI de- 
tection, outperforming well-established methodologies, such as Meena , Zhang 

180}, Nguyen [206], and Huang [207], among others. 

The proposed CoStNet achieves an accuracy of 89.91% on the benchmark 
LSCGB dataset, showcasing its robust performance in detecting CGIs. While 
CoStNet slightly lags behind the method proposed in [187], which holds the highest 
accuracy at 91.45%, the 1.54% difference is relatively modest in the broader con- 
text of CGIs detection. CoStNet’s competitive standing underscores its effective- 
ness and reliability in addressing the challenges posed by the large-scale LSCGB 
dataset. Notably, CoStNet outperforms several other state-of-the-art methods 
listed in Table positioning it as a strong contender for practical applications 
in image forensics. The subtle variations in accuracy underscore the competitive- 
ness of both methods in tackling the intricacies of the LSCGB dataset. Moreover, 
the proposed CoStNet contributes to the diversity of high-performing algorithms, 
offering a viable alternative for practitioners in need of reliable image forensics 
tools. 

The accuracies achieved by CoStNet on the benchmark DSTok and Rahmouni 
datasets positions it as a cutting-edge solution in image manipulation detection. 
While facing a slightly more competitive landscape on the LSCGB dataset, CoSt- 
Net remains at the forefront of advancements in this domain, contributing signifi- 
cantly to the state-of-the-art methodologies in the field of CGIs detection. 
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Table 7.3: Detection accuracy (%) of state-of-the-art methods on benchmark 
datasets. Accuracies with ' were obtained from , while the rest were obtained 
from [187]. 


Algorithms DSTok [186] Rahmouni LSCGB [187] 
[183] 

Rahmouni [183 75.49 | 85.39 77.45 
Quan {190 93.74 ! 90.49 82.80 
Yao {189 93.35 ! 92.93 82.91 
Gando [208] 85.50 ! - - 

De Rezende [182) 95.02 ' - - 
He [181 91.58! - - 
Quan 95.30 ' - - 
Zhang 91.97! 99.72 90.42 

Chawla [209] 85.11 94.46 77.12 
Nguyen [206] 94.42 99.71 90.02 
Huang 94.24 99.56 90.18 
Meena {205 93.65 99.70 90.09 
Yao 96.10 T - - 
Bai 96.35 99.94 91.45 
CoStNet 97.11 100.00 89.91 
(Proposed) 
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Figure 7.6: Training loss versus epochs during the first training stage of CoStNet 
on the DSTok dataset. 
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Figure 7.7: Validation accuracy versus epochs during the first training stage of 
CoStNet on the DSTok dataset. 


7.5.3 Parameters’ assessment and generalization ability 


The performance of CoStNet depends on various parameters. An extensive study 
was performed to apprehend which parameters affect it. A series of experiments 
was carried out to investigate how the number of epochs affects the detection 
results in the two stages. During the experiments, the batch size was fixed to 
200. A top accuracy of 97.11% was measured when 100 epochs were employed in 
both stages. It is worth noting that an accuracy of 96.63% was achieved after 20 
training epochs in both stages, outperforming the method proposed in [178]. It is 
also interesting to note that after only 10 training epochs, the proposed approach 
derived an accuracy of 95.81%, which, although slightly inferior to that reported 
in [178], is still rated as the second-best method. The accuracy results for various 
epochs are listed in Table 


Table 7.4: Detection accuracy (%) for various numbers of training epochs on the 
DSTok dataset. 
Epochs 10 20 30 50 60 70 100 


Accuracy 95.81 96.63 96.91 96.83 96.89 96.92 97.11 
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A second set of experiments was conducted to assess the contribution of the 
batch size in the classification accuracy. Various values for batch size were tested, 
while the number of epochs was kept fixed at 100. The result in accuracy for 
a batch size of 250 still remained above the 97% bound. For batch sizes smaller 
than 250, the accuracy was above 96%, demonstrating that CoStNet is not severely 
affected by the batch size. It is noteworthy that when a small batch size of 20 
samples was employed, an accuracy of 96.75% was measured, still outperforming 
the state-of-the-art method and demonstrating the classification ability of 
CoStNet. The accuracy results for various batch sizes are listed in Table 


Table 7.5: Detection accuracy (%) for various batch sizes on the DSTok dataset. 
Batch 20 30 50 60 70 100 250 
size 


Accuracy 96.75 96.59 96.83 96.56 96.47 96.81 97.03 


A very important aspect of the model is related to its generalization ability, 
i.e., the proficiency to accurately classify unfamiliar data derived from a diverse 
array of setups. In pursuit of this, we harnessed the prowess of our trained model 
on the DSTok dataset and subjected it to rigorous evaluation on the well-known 
Rahmouni dataset, transcending boundaries with cross-dataset testing. In Ta- 
ble the accuracies of CoStNet trained on the DSTok dataset and tested on 
Rahmouni’s test set are summarized. We present the garnered accuracies of the 
CoStNet model trained on the DSTok dataset, rigorously tested on Rahmouni’s 
distinguished test set. While the accuracy of CoStNet registers at 73.67%, it takes 
its place as the third top-performing contender, maintaining its stature even amid 
more formidable challenges. 

The discrepancy between the top ranking within the DSTok dataset and the 
subsequent third-place position in the cross-dataset testing illuminates a significant 
challenge in deep learning methodologies: their susceptibility to dataset variations. 
The substantial disparity between the performance metrics achieved within the 
DSTok dataset, where the proposed method and the majority of the models sur- 
passed the 90% accuracy threshold, and the diminished performance observed on 
Rahmouni’s dataset underscores the considerable impact of dataset dissimilarities. 
Rahmouni’s dataset presents limitations due to its divergent stylistic attributes, 
diverse content structures, and potentially distinct contextual elements compared 
to the DSTok dataset. These disparities extend beyond quantitative differences 
and significantly affect model adaptability when confronting unfamiliar data dis- 
tributions. 

The principal issue lies in the models’ challenge in generalizing effectively across 
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dissimilar datasets. Despite exhibiting commendable performance within the fa- 
miliar confines of the DSTok training data, the noticeable deterioration in detection 
accuracy across all models in the cross-dataset assessment highlights the substan- 
tial divergence between the datasets, emphasizing the critical need to fortify models 
against such variations. 


Table 7.6: Detection accuracy (%) in cross-dataset testing. State-of-the-art meth- 
ods and the proposed CoStNet are trained on the DSTok dataset and tested on 
Rahmouni’s dataset. 


Algorithms Rahmouni’s Dataset 
Rahmouni [183 60.85 
Quan [190 56.43 
Yao {189 78.37 
Gando 67.48 
De Rezende [182] 73.00 
He [181] 56.78 
Zhang 61.78 
Quan 59.36 
Yao 82.41 
CoStNet (Proposed) 73.67 


In the pursuit of enhancing CoStNet’s generalization capabilities, the model 
was systematically trained on the complex LSCGB dataset and subsequently eval- 
uated through cross-dataset testing on the DSTok dataset. The results demon- 
strate a significant performance milestone, with CoStNet surpassing established 
state-of-the-art methods by achieving a detection accuracy of 93.03%, as shown in 
Table|7.7| This denotes a substantial advancement compared to prior assessments 
on Rahmouni’s dataset, affirming CoStNet’s adaptability and resilience to diverse 
and challenging data distributions. 

Notably, CoStNet achieves a detection accuracy of 93.03%, outperforming 
state-of-the-art algorithms. Compared to the highest-performing state-of-the-art 
algorithm, Bai , which attains an accuracy of 83.95%, CoStNet demonstrates a 
substantial improvement of 11.21%. Furthermore, when contrasted with the mean 
accuracy of the baseline methods (approximately 78.63%), CoStNet exhibits an 
impressive percentage increase of approximately 18.31%. These results underscore 
the notable efficacy of CoStNet in surpassing established algorithms, showcasing its 
proficiency in handling the cross-dataset challenges posed by the DSTok dataset. 

CoStNet’s robust generalization is evident when trained on challenging datasets, 
such as the LSCGB dataset. Exposure to increased complexity and diverse data 
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modalities enhances the model’s adaptability. This observed phenomenon high- 
lights CoStNet’s capacity to discern intricate patterns, facilitating adept general- 
ization across diverse contexts. Systematic training on challenging datasets plays 
a crucial role in fortifying the model against overfitting and enabling it to capture 
underlying structures that transcend dataset-specific nuances. This scientific ob- 
servation emphasizes the pragmatic utility of subjecting deep learning models to 
progressively complex training scenarios for enhanced real-world applicability. 


Table 7.7: Detection accuracy (%) in cross-dataset testing. State-of-the-art meth- 
ods and the proposed CoStNet are trained on the LSCGB dataset and tested on 
the DSTok one. 


Algorithms DSTok Dataset 
Nguyen |206 78.71 
Huang 80.78 
Zhang [180] 72.57 
VGG-19 [144] 77.16 
Bai 83.95 
CoStNet (Proposed) 93.03 


7.5.4 Robustness capability 


In the context of real-world digital forensics, an essential criterion for a compre- 
hensive system is its ability to exhibit robustness against a spectrum of noise types 
and levels. To ascertain the effectiveness of our CoStNet model, we conducted a 
set of meticulously designed experiments, aligning with established literature, to 
facilitate a direct comparison with existing methods. This approach allowed us to 
evaluate the model’s performance under diverse conditions. 

Within the initial experimental set, we introduced salt-and-pepper noise to 
the test samples from the DSTok dataset, mirroring the methodology outlined in 
178], while maintaining consistent signal-to-noise ratios (SNRs) at 0.99, 0.95, and 
0.9. The summary in Table [7.8] provides a quantitative overview of the outcomes. 
An example of injected noise is depicted in Fig. When the test samples 
were infected with salt-and-pepper noise with SNR = 0.99, the proposed approach 
yielded an accuracy of 95.20% demonstrating its potential, while the method pro- 
posed in was lagging behind with an accuracy of 93.08%. When the SNR of 
the injected noise was decreased to 0.95, CoStNet achieved an accuracy of 92.97%, 
outperforming its competitors. Even in the most challenging condition of SNR 
0.9, CoStNet maintained an accuracy of 90.23%, notably exceeding the second- 
best method’s accuracy of 82.38% [178]. It is worth mentioning that 6 out of 10 
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approaches achieved an accuracy of about 50%, while their accuracy exceeded 90% 
in the original experiments. For example, the detection accuracy of the Quan [179] 
algorithm exhibited a decrement from 95.30% to 55.58% subsequent to exposure to 
salt-and-pepper noise, indicating a susceptibility to perturbations mirroring real- 
world scenarios, thus reflecting a limited robustness under such conditions. Similar 
behavior was noticed also in Rahmouni [183], Yao [189], He [181], Zhang [180], and 
Quan [190]. Amid the complexities of noise interference, CoStNet emerges as an 
exemplar of adaptability, underscoring its potential to thrive under demanding 
real-world conditions. 


Table 7.8: Detection accuracy (%) after salt-and-pepper noise attack on the test 
images. 


Algorithms SNR = 0.99 SNR = 0.95 SNR = 0.9 
Rahmouni [183 52.59 51.36 50.73 
Quan [190] 50.27 50.02 49.99 
Yao [189] 47.96 45.44 50.00 
Gando 79.01 70.52 64.53 
De Rezende 92.19 86.63 80.55 
He [181 50.18 50.01 50.05 
Zhang [180] 57.50 52.67 51.93 
Quan [179] 59.58 48.79 49.35 
Yao 93.08 88.91 82.38 
CoStNet 95.20 92.97 90.23 
(Proposed) 
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Figure 7.8: Original CGI and the CGls altered by injecting salt-and-pepper noise 
at various SNRs. On the top left, the original CGI is depicted. On the top right, 
the CGI with a SNR = 0.99 is illustrated. On the bottom left, the CGI is depicted 
with aSNR = 0.95. On the bottom right, the CGI with a SNR = 0.9 is illustrated. 


A subsequent series of experiments was undertaken, involving the introduction 
of Gaussian noise. Drawing inspiration from the methodology outlined in [178], 
we set the Gaussian noise’s mean value to 0 while maintaining a signal-to-noise 
ratio (SNR) of 0.7. This experimental protocol also entailed the exploration of 
three distinct standard deviations (SD) for the noise, specifically 10, 30, and 50. 
The results of detection accuracy are presented in Table and elucidate the 
model’s performance across varying degrees of Gaussian perturbations. When 
SD=10, the proposed approach yielded an accuracy of 66.13%, being placed in 
the fifth place with respect to accuracy. The top-performing approach was the 
method proposed in [182]. When the SD was increased to 30, CoStNet was rated 
as the fourth top-performing out of the ten methods with an accuracy of 62.03%. 
Notably, when the SD was increased to 50, CoStNet resulted in an accuracy of 
60.97% ranked as the third best performing method. This fact demonstrates that 
the greater SD of noise, the better the ranking of the proposed method. When the 
SD increases, De Rezende’s approach maintains a high detection accuracy of 
96.63%, demonstrating its robustness in such kind of attack. We argue that this 
occurrence stems from the preprocessing methodology utilized by this method. 
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Such preprocessing involves the deduction of the mean RGB value of the ImageNet 
dataset from each pixel during the preprocessing phase. The detection accuracy 
when SD increases notably deteriorates, demonstrating that this form of attack 
profoundly impacts the overall performance of the models. 

Relative deteriorations in accuracies across varying levels of Gaussian noise at- 
tacks reveal notable trends among the evaluated methods. De Rezende’s approach 
showcased considerable vulnerability, experiencing a 19.27% deterioration from SD 
= 10 to SD = 30 and a total 30.21% decrease from SD = 10 to SD = 50. Simi- 
larly, the Yao method exhibited significant susceptibility, with deteriorations 
of 9.61% from SD = 10 to SD = 30 and 18.50% from SD = 10 to SD = 50. In con- 
trast, the proposed CoStNet method demonstrated relatively better robustness, 
showcasing deteriorations of 6.20% from SD = 10 to SD = 30 and 7.80% from SD 
= 10 to SD = 50. These observations underline the varying degrees of resilience 
among the evaluated algorithms against escalating levels of Gaussian noise, with 
De Rezende displaying the most pronounced sensitivity and CoStNet illustrating 
relatively improved stability in the face of increasing noise levels. 


Table 7.9: Detection accuracy (%) after Gaussian noise attack on test images. 


Algorithms SD = 10 SD = 30 SD = 50 
Rahmouni [183 52.19 50.00 50.25 
Quan {190 52.95 49.66 48.91 
Yao [189 44.31 41.23 50.00 
Gando [208] 75.00 65.08 57.50 
De Rezende [182) 96.63 78.00 67.44 
He [181 72.38 57.47 54.41 
Zhang [180] 54.64 50.30 49.12 
Quan 51.39 50.24 49.12 
Yao [178] 88.54 80.03 72.16 
CoStNet 66.13 62.03 60.97 
(Proposed) 


7.5.5 Impact of style transfer (Ablation study) 


The proposed CoStNet benefits from the style transfer module, which acts in a 
complementary manner, enriching the training procedure with additional training 
CG samples. The research question that arises refers to the contribution of the 
style transfer module in cases of reduced training samples. Four different experi- 
ments were conducted in which the style transfer module had different quantitative 
contributions to training samples, as depicted in Fig. Specifically, 75%, 50%, 
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25%, and 10% of the initial CG training samples were randomly removed and re- 
placed with the same percentages using the style transfer module in the DSTok 
dataset such that the original number of training samples remains unchanged. 
The best accuracy was observed when the style transfer module replaced 10% of 
the training samples with CGIs, reaching an accuracy of 97.09%, outperforming 
the CGNet [178], which derived an accuracy of 96.10%. When 25% of the origi- 
nal CG training samples were removed and replaced by the style module, CoStNet 
achieved an accuracy of 96.56%. When half of the training samples were randomly 
removed and replaced by the style transfer module, CoStNet reached an accuracy 
of 95.75%, performing accurately and being placed second. Finally, when only 
25% of the original training samples were retained and the style transfer module 
completed the rest of the samples, CoStNet reached an accuracy of 94.72%. The 
results of the ablation study demonstrate the significant contribution of the style 
transfer module in achieving improved accuracy with reduced training samples. 
This finding supports the claim that incorporating the style transfer module into 
the proposed architecture can lead to more effective and accurate predictions. Such 
insights provide valuable guidance for future research in this area and suggest that 
the proposed approach has the potential to enhance the performance of a wide 
range of machine learning applications in the context of discriminating CGIs from 
Ns. 
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Figure 7.9: Accuracy of the proposed CoStNet on the DSTok dataset when several 
portions of the original dataset are retained. 


7.5.6 Statistical significance 


It is very important to underscore a nuanced but impactful observation in the 
realm of accuracy disparities between our proposed method and the state-of-the- 
art methodology expounded in [178] on the DSTok dataset. Specifically, a marginal 
deviation of 1.01% is discernible when circumstances entail the integration of the 
style transfer module to augment the reservoir of training samples, as meticu- 
lously delineated in Section [7.5.2] Moreover, a commensurate distinction of 0.99% 
surfaces when the style transfer module assumes a pivotal role in replenishing 
10% of the training set encompassing CGIs. This strategic recalibration is aimed 
at aligning with the numerical representation stipulated in [178]. This nuanced 
differential serves to accentuate the meticulous precision and unwavering stability 
inherent in our methodology. Furthermore, it serves as a testament to the method’s 
remarkable consistency across diverse settings and methodologies for assimilating 
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supplementary data. These insightful differentials aptly underscore the method’s 
inherent robustness and reliability. 

To check whether the accuracy difference of 0.99% is statistically significant, the 
approximate analysis in [210] is applied. The accuracies w, and @2 are binomially 
distributed random variables. If 1, @2 denote the empirical accuracies, and @ = 
te. the hypothesis Hp :w,; = w2 = @ is tested at 95% level of significance. 
The accuracy difference has a variance of 6 = 9=a@) where N is the number 
of images. For ¢ = 1.65/42, if |, — @e| > ¢, Hp is rejected with a risk 5% of 
being wrong. Similarly, there is sufficient evidence to warrant the rejection of the 
claim that both the CGNet and CoStNet methods attain the same accuracy. 
Accordingly, in 95% of repetitions of the experiment, CoStNet is expected to 
outperform CGNet |178]. The aforementioned analysis certifies that, in our case, 
the obtained ¢ = 0.24% indicates that the observed 0.99% accuracy difference 
between the proposed framework and the state-of-the-art CGNet reported in 
is statistically significant. 

The same procedure was employed to check whether the accuracy difference 
of 0.76% between the proposed CoStNet and the method presented in when 
both were trained and tested on the DSTok dataset is statistically significant. 
In that case, the obtained ¢ = 0.23% indicates that the observed 0.76% accuracy 
difference between the proposed framework and the method reported in is sta- 
tistically significant. This analysis provides strong evidence that the performance 
enhancements achieved by our method over the state-of-the-art CGNet and 
the method presented in [187] is not merely incidental, but rather statistically 
validated. This statistical rigor not only complements the empirical observations 
but also reinforces the credibility of claims regarding the method’s effectiveness, 
lending support to the notion of the robustness and reliability of the proposed 
methodology. 
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This doctoral thesis has delved into various aspects of multimedia authentication 
and information evaluation, conducting a comprehensive examination and analy- 
sis. The research work has encompassed the development of multiple approaches 
for assessing the veracity and authenticity of information, as well as for processing 
and recommending multimodal information. The investigation has been focused 
on three key areas: 1) leveraging ENF for multimedia authentication and integrity 
verification, 2) employing hypergraph learning for faithful personalized image rec- 
ommendation, and 3) distinguishing between NIs and CGIs. The proposed ap- 
proaches have been built upon foundational principles such as spectral analysis, 
hypergraph learning with optimization theory, and deep neural networks, which 
form the fundamental framework for the research endeavours. 

Specifically, in Chapter [2] an approach based on the design of a lag window for 
Blackman-Tukey estimator has been proposed to suppress the strong interference 
present in audio recordings that hinders ENF estimation. The implementation 
of a lag window design ensures an optimal balance between minimizing smearing 
and leakage effects, thereby providing a robust estimation of ENF under diverse 
SNR conditions. The proposed methodology has undergone rigorous testing on au- 
thentic datasets, and its performance was compared against state-of-the-art ENF 
estimation techniques. Through the utilization of a frame-based approach, mul- 
tiple frequency estimation methods have been evaluated on the entire sequence, 
divided into consecutive overlapping frames. Experimental results have demon- 
strated that when the raw datasets undergo filtering using a meticulously designed 
band-pass filter, considering factors such as band-pass edges and filter order based 
on the recording’s characteristics, both non-parametric and parametric spectral 
estimation techniques deliver highly accurate ENF estimations. However, certain 
challenges arise, as demonstrated by experiments that have been conducted on 
mixed recordings, highlighting the non-trivial nature of selecting an appropriate 
spectral estimation method. The results have conclusively demonstrated the supe- 
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riority over numerous competing methods, highlighting the remarkable potential 
in the field of ENF estimation. Furthermore, statistical tests have validated the 
significance of the observed performance improvements. Future research endeav- 
ours may focus on subjecting the proposed approach to more demanding scenarios, 
such as extracting ENF from both images and videos. This exploration could shed 
light on the efficacy and adaptability of the approach in diverse contexts, paving 
the way for further advancements in the field. 

In Chapter [3] a novel ENF estimation approach based on the filter-bank Capon 
method, incorporating temporal windowing has been introduced. By leveraging 
the Toeplitz structure of covariance matrices and exploiting Krylov matrices, a 
fast and efficient method has been developed, demonstrating superior accuracy 
compared to existing techniques in power recordings. Remarkably, this approach 
maintains its speed and effectiveness even when utilizing very short frame lengths. 
Consequently, it offers an excellent balance between speed and accuracy, a crucial 
aspect in forensic applications. Furthermore, extensive experiments have high- 
lighted the non-trivial nature of window selection, with methods like the STFT 
proving capable of achieving excellent results, even with very short frame lengths. 
More specifically, with a proper window selection, i.e., a Parzen one of 1-sec du- 
ration, a trivial method, such as the STFT, outperforms state-of-the-art methods 
demonstrating high efficiency in ENF estimation. Future research could inves- 
tigate the feasibility of implementing the proposed approach for on-the-fly ENF 
estimation from individual images in real-world forensic applications. 

In Chapter [4] an innovative automated approach for ENF estimation in static 
and non-static videos captured using CCD sensors has been introduced. The ap- 
proach leverages the SLIC algorithm to generate regions with similar characteris- 
tics, where ENF variations can be precisely revealed. Moreover, we have explored 
multiple videos recorded by a fixed camera. A scenario with a moving camera 
would possibly raise additional difficulties in finding areas of similar characteris- 
tics, which are employed in the proposed approach. Consequently, difficulties in 
accurately estimating the ENF estimate would be anticipated. In addition, al- 
though the recordings were of escalating difficulty, there was no more than one 
person present in the scene. Experimental results have demonstrated that the 
proposed approach, employing either STFT or ESPRIT methods, outperforms the 
state-of-the-art in ENF estimation, as measured by the maximum correlation coef- 
ficient. Additionally, the study has examined the impact of segment duration and 
bandpass filter order on ENF estimation accuracy. Statistical tests have confirmed 
the statistical significance of the performance improvements achieved by the pro- 
posed approach compared to the state-of-the-art approach utilizing the MUSIC 
method. Future work will aim to expand upon the current research by consider- 
ing recordings captured using the rolling shutter mechanism of CMOS cameras. 
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Additionally, we are interested in investigating ENF estimation in scenarios where 
non-static cameras are utilized, which is prevalent in real-life applications such 
as mobile phone recordings. Another challenging research direction involves ENF 
estimation when multiple persons are recorded in a video. These research avenues 
pose unique challenges and require the development of novel techniques to accu- 
rately estimate ENF in these complex scenarios. Addressing these challenges will 
contribute to a more comprehensive understanding of ENF estimation in diverse 
recording conditions and pave the way for practical applications in various fields. 

In Chapter |5| an innovative approach for hypergraph learning has been pos- 
tulated, aimed at enhancing image and tag recommendation systems. This novel 
method amalgamates hypergraph topology learning, hyperedge weight updates, 
and hypergraph ranking into a comprehensive multi-stage optimization scheme. To 
procure optimal recommendation precision, an exhaustive investigation of various 
values for the parameter a has been undertaken during the hypergraph topology 
learning stage. In addition to this, CNN features have been employed as visual 
attributes to augment the semantic image annotation, consequently boosting the 
overall system efficiency. This methodology has been thoroughly assessed utilizing 
a dataset comprising images of Greek POIs derived from Flickr, in conjunction with 
the NUS-WIDE-LITE dataset. The empirical results have demonstrated that the 
proposed methods, namely HMSO and CSL, superseded contemporary methods 
in hypergraph ranking for image and tag recommendation. Moreover, an LMS ap- 
proach for adapting hyperedge weights has been formulated and evaluated against 
the conventional closed expression method for hyperedge weight adaptation. The 
evaluation was conducted on a subset of the image recommendation dataset intro- 
duced in this study. Given the demonstrated efficacy of the proposed methods, it is 
compelling to apply these approaches to more intricate video datasets, potentially 
leveraging keyframes for further analysis. 

In Chapter [6] two distinct approaches for optimizing the ranking vector within 
hypergraph learning have been introduced. These approaches, namely block ran- 
domized SVD for matrix inversion and conjugate gradient for solving a set of linear 
equations, have been specifically designed to act complementary in optimizing the 
objective function while maintaining fixed hyperedge weights. Experimental re- 
sults have demonstrated that both approaches achieve computational time savings 
without compromising the F measure across multiple ranking positions. Notably, 
employing the conjugate gradient method in ITH and ITH-HWEG has exhibited 
the most significant improvements. These findings are highly encouraging and 
provide a strong impetus for further investigation, particularly in accommodating 
larger image datasets beyond the scope of the current study. 

In Chapter |7| an end-to-end deep learning framework, denoted as CoStNet, 
has been introduced as a novel solution for the application of distinguishing NIs 
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from CGlIs. The innovation combines the principles of supervised contrastive 
learning, arbitrary style transfer, and the ResNet-18 architecture within a unique 
two-module framework. Through the integration of contrastive learning, CoStNet 
circumvents the necessity for hand-engineered features and adeptly captures intri- 
cate feature representations inherent in the training data, thereby enabling precise 
classification. Notably, the incorporation of the style transfer module extends the 
efficacy of training by enriching the dataset with an amplified array of negative 
samples beyond the confines of the original dataset. The robustness and efficacy of 
CoStNet are substantiated through a comprehensive series of experiments, lever- 
aging the benchmark DSTok, Rahmouni, and LSCGB datasets. Furthermore, its 
prowess is evaluated in terms of both its generalization capacity and resilience 
through cross-dataset testing. An in-depth ablation study elucidates the pivotal 
role played by the style transfer module, particularly in scenarios with constrained 
training data availability. Significantly, statistical tests substantiate the statistical 
significance of the performance enhancements achieved by CoStNet. The pro- 
posed framework’s efficacy and efficiency provide a roadmap for further research 
directions. This could involve the development of more robust CNN architectures 
tailored to handle even more diverse datasets and noise conditions. Advanced noise 
reduction techniques or regularization methods could be explored to improve the 
model’s resilience to Gaussian noise. Additionally, investigating transfer learning 
strategies may enhance model generalization across different datasets, ultimately 
advancing the framework’s applicability in real-world scenarios. 
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